lets take this example Pandas dataframe which has two columns ['date'] and ['price']: ['date'] is ascending always ['price'] is random
df = pd.DataFrame({
'date':['01/01/2019','01/02/2019','01/03/2019','01/04/2019','01/05/2019','01/06/2019','01/07/2019','01/08/2019','01/09/2019','01/10/2019'],
'price': [10,2,5,4,12,8,9,19,12,3]
})
the goal is to add two more columns ['next_date'] contains the date of the first occurrence of a price which is greater than current price ['next_price'] contains the price of the first occurrence of a price which is greater than current price
like this
date price next_date next_price
0 01/01/2019 10 01/05/2019 12
1 01/02/2019 2 01/03/2019 5
2 01/03/2019 5 01/05/2019 12
3 01/04/2019 4 01/05/2019 12
4 01/05/2019 12 01/08/2019 19
5 01/06/2019 8 01/07/2019 9
6 01/07/2019 9 01/08/2019 19
7 01/08/2019 19 NaN NaN
8 01/09/2019 12 NaN NaN
9 01/10/2019 3 NaN NaN
I've test some solutions which did what i want but with very poor performance the real df has over a million rows
These are my test solutions:
using Pandasql
result = sqldf("SELECT l.date, l.price, min(r.date) as next_date from df as l left join df as r on (r.date > l.date and r.price > l.price) group by l.date, l.price order by l.date")
result=pd.merge(result ,df, left_on='next_date', right_on='date', suffixes=('', '_next'), how='left')
print(result)
using Pandas to SQLite
df.to_sql('df', conn, index=False)
qry = "SELECT l.date, l.price, min(r.date) as next_date from df as l left join df as r on (r.date > l.date and r.price > l.price) group by l.date, l.price order by l.date "
result = pd.read_sql_query(qry, conn)
result=pd.merge(result ,df, left_on='next_date', right_on='date', suffixes=('', '_next'), how='left')
print(result)
using Apply
def find_next_price(row):
mask = (df['price'] > row['price']) & (df['date'] > row['date'])
if len(df[mask]):
return df[mask]['date'].iloc[0], df[mask]['price'].iloc[0]
else:
return np.nan, np.nan
df[['next_date', 'next_price']] = list(df.apply(find_next_price, axis=1))
print(df)
some of these solutions start to fail on 50000 rows df, while i need to perform this task on a 1000000 rows df
note: there is a very similar question here: but also poor performance https://stackoverflow.com/questions/72047646/python-pandas-add-column-containing-first-index-where-future-column-value-is-gr
CodePudding user response:
Since you need to perform this task on large number of rows (1M ), a traditional approach with numpy
might not be feasible especially when you limited amount of memory. Here I'm presenting a functional approach using basic algorithmic computation and you can compile this function with numba's
just in time compiler to achieve C
like speeds:
import numba
@numba.njit
def argmax(price: np.ndarray):
for i in range(len(price)):
idx = -1
for j in range(i 1, len(price)):
if price[i] < price[j]:
idx = j
break
yield idx
idx = -1
i = np.array(list(argmax(df['price'].values)))
m = i != -1 # index is -1 if there's no next greater price
df.loc[m, 'next_date'] = df['date'].values[i[m]]
df.loc[m, 'next_price'] = df['price'].values[i[m]]
Result
date price next_date next_price
0 01/01/2019 10 01/05/2019 12.0
1 01/02/2019 2 01/03/2019 5.0
2 01/03/2019 5 01/05/2019 12.0
3 01/04/2019 4 01/05/2019 12.0
4 01/05/2019 12 01/08/2019 19.0
5 01/06/2019 8 01/07/2019 9.0
6 01/07/2019 9 01/08/2019 19.0
7 01/08/2019 19 NaN NaN
8 01/09/2019 12 NaN NaN
9 01/10/2019 3 NaN NaN
PS: Solution is tested on 1M rows.