I'm not familiar enough with Python to understand how I can make a for loop go faster. Here's what I'm trying to do.
Let's say we have the following dataframe of prices.
import pandas as pd
df = pd.DataFrame.from_dict({'price': {0: 98, 1: 99, 2: 101, 3: 99, 4: 97, 5: 100, 6: 100, 7: 98}})
The goal is to create a new column called updown, which classifies each row as "up" or "down", signifying what comes first when looking at each subsequent row - up by 2, or down by 2.
df['updown'] = 0
for i in range(df.shape[0]):
j=0
while df.price.iloc[i j] < (df.price.iloc[i] 2) and df.price.iloc[i j] > (df.price.iloc[i] - 2):
j= j 1
if df.price.iloc[i j] >= (df.price.iloc[i] 2):
df.updown.iloc[i] = "Up"
if df.price.iloc[i j] <= (df.price.iloc[i] - 2):
df.updown.iloc[i] = "Down"
This works just fine, but simply runs too slow when running on millions of rows. Note that I am aware the code throws an error once it gets to the last row, which is fine with me.
Where can I learn how to make something like this happen much faster (ideally seconds, or at least minutes, as opposed to 10 hours, which is how long it takes right now.
CodePudding user response:
df.loc[df['price'] - df['price'].shift(2) >= 2, 'updown'] = 'up'
df.loc[df['price'] - df['price'].shift(2) <= 2, 'updown'] = 'down'
This should do.
CodePudding user response:
There are various errors stopping the example code from working, at least for me. Could you please confirm this is what you want the algorithm to do?
import pandas as pd
df = pd.DataFrame.from_dict({'price': {0: 98, 1: 99, 2: 101, 3: 99, 4: 97, 5: 100, 6: 100, 7: 98}})
df['updown'] = 0
count = df.shape[0]
for i in range(count):
j = 1
up = df.price.iloc[i] 2
down = up - 4
while (pos := i j) < count:
if(value := df.price.iloc[pos]) >= up:
df.loc[i, 'updown'] = "Up"
break
elif value <= down:
df.loc[i, 'updown'] = "Down"
break
else:
j = j 1
print(df)
CodePudding user response:
Running through a bunch of different examples, the second method in the following code is approximate x75 faster:
import pandas as pd, numpy as np
from random import randint
import time
data = [randint(90, 120) for i in range(10000)]
df1 = pd.DataFrame({'price': data})
t0 = time.time()
df1['updown'] = np.nan
count = df1.shape[0]
for i in range(count):
j = 1
up = df1.price.iloc[i] 2
down = up - 4
while (pos := i j) < count:
if(value := df1.price.iloc[pos]) >= up:
df1.loc[i, 'updown'] = "Up"
break
elif value <= down:
df1.loc[i, 'updown'] = "Down"
break
else:
j = j 1
t1 = time.time()
print(f'Method 1: {t1 - t0}')
res1 = df1.head()
df2 = pd.DataFrame({'price': data})
t2 = time.time()
count = len(df2)
df2['updown'] = np.nan
up = df2.price 2
down = df2.price - 2
i = 1
while (i < count) and ((i == 1) or df2.updown[:-(i - 1)].isna().any()):
shift = df2.price.shift(-i)
isna = df2.updown.isna()
df2.loc[isna & (shift >= up), 'updown'] = 'Up'
df2.loc[isna & (shift <= down), 'updown'] = 'Down'
i = 1
t3 = time.time()
print(f'Method 2: {t3 - t2}')
s1 = df1.updown
s2 = df2.updown
match = (s1.isnull() == s2.isnull()).all() and (s1[s1.notnull()] == s2[s2.notnull()]).all()
print(f'Series match: {match}')