I have a dataframe in which I want to execute a function that checks if the actual value is a relative maximum, and check if the previous ''n'' values are lower than the actual value.
Having a dataframe 'df_data':
temp_list = [128.71, 130.2242, 131.0, 131.45, 129.69, 130.17, 132.63, 131.63, 131.0499, 131.74, 133.6116, 134.74, 135.99, 138.789, 137.34, 133.46, 132.43, 134.405, 128.31, 129.1]
df_data = pd.DataFrame(temp)
First I create a function that will check the previous conditions:
def get_max(high, rolling_max, prev,post):
if ((high > prev) & (high>post) & (high>rolling_max)):
return 1
else:
return 0
df_data['rolling_max'] = df_data.high.rolling(n).max().shift()
Then I apply previous condition row wise:
df_data['ismax'] = df_data.apply(lambda x: get_max(df_data['high'], df_data['rolling_max'],df_data['high'].shift(1),df_data['high'].shift(-1)),axis = 1)
The problem is that I have always get the following error:
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
Which comes due to applying the boolean condition from 'get_max' function to a Serie.
I will love to have a vectorized function, not using loops.
CodePudding user response:
Try:
df_data['ismax'] = ((df_data['high'].gt(df_data.high.rolling(n).max().shift())) & (df_data['high'].gt(df_data['high'].shift(1))) & (df_data['high'].gt(df_data['high'].shift(-1)))).astype(int)
CodePudding user response:
The error is occuring because you are sending the entire series (entire column) to your get_max function rather than doing it row-wise. Creating new columns for the shifted "prev" and "post" values and then using df.apply(func, axis = 1)
normally will work fine here.
As you have hinted at, this solution is quite inefficient and looping through every row will become much slower as your dataframe increases in size.
On my computer, the below code posts:
- LIST_MULTIPLIER = 1, Vectorised code: 0.29s, Row-wise code: 0.38s
- LIST_MULTIPLIER = 100, Vectorised code: 0.31s, Row-wise code = 13.27s
In general therefore it is best to avoid using df.apply(..., axis = 1)
as you can almost always get a better solution using logical operators.
import pandas as pd
from datetime import datetime
LIST_MULTIPLIER = 100
ITERATIONS = 100
def get_dataframe():
temp_list = [128.71, 130.2242, 131.0, 131.45, 129.69, 130.17, 132.63,
131.63, 131.0499, 131.74, 133.6116, 134.74, 135.99,
138.789, 137.34, 133.46, 132.43, 134.405, 128.31, 129.1] * LIST_MULTIPLIER
df = pd.DataFrame(temp_list)
df.columns = ['high']
return df
df_original = get_dataframe()
t1 = datetime.now()
for i in range(ITERATIONS):
df = df_original.copy()
df['rolling_max'] = df.high.rolling(2).max().shift()
df['high_prev'] = df['high'].shift(1)
df['high_post'] = df['high'].shift(-1)
mask_prev = df['high'] > df['high_prev']
mask_post = df['high'] > df['high_post']
mask_rolling = df['high'] > df['rolling_max']
mask_max = mask_prev & mask_post & mask_rolling
df['ismax'] = 0
df.loc[mask_max, 'ismax'] = 1
t2 = datetime.now()
print(f"{t2 - t1}")
df_first_method = df.copy()
t3 = datetime.now()
def get_max_rowwise(row):
if ((row.high > row.high_prev) &
(row.high > row.high_post) &
(row.high > row.rolling_max)):
return 1
else:
return 0
for i in range(ITERATIONS):
df = df_original.copy()
df['rolling_max'] = df.high.rolling(2).max().shift()
df['high_prev'] = df['high'].shift(1)
df['high_post'] = df['high'].shift(-1)
df['ismax'] = df.apply(get_max_rowwise, axis = 1)
t4 = datetime.now()
print(f"{t4 - t3}")
df_second_method = df.copy()