I have a pandas DataFrame
, which stores stock price and time, time column's type is pd.datetime
.
here is a demo:
import pandas as pd
df = pd.DataFrame([['2022-09-01 09:33:00', 100.], ['2022-09-01 09:33:14', 101.], ['2022-09-01 09:33:16', 99.4], ['2022-09-01 09:33:30', 100.9]], columns=['time', 'price'])
df['time'] = pd.to_datetime(df['time'])
In [11]: df
Out[11]:
time price
0 2022-09-01 09:33:00 100.0
1 2022-09-01 09:33:14 101.0
2 2022-09-01 09:33:16 99.4
3 2022-09-01 09:33:30 100.9
I want to calculate future return in 15s. (first price after 15 second - current price)
which I want is:
In [13]: df
Out[13]:
time price return
0 2022-09-01 09:33:00 100.0 -0.6 // the future price is 99.4, period is 16s
1 2022-09-01 09:33:14 101.0 -0.1 // the future price is 100.9, period is 16s
2 2022-09-01 09:33:16 99.4 NaN
3 2022-09-01 09:33:30 100.9 NaN
I know df.diff
can get difference in index, is there any good methods can do this?
CodePudding user response:
merge_asof
to the rescue
Subtract a timedelta of 15s
from the right
dataframe then self merge on time
using merge_asof
with direction=forward
which selects the first row in right
dataframe whose on key is greater than or equal to the on key in the left
dataframe then subtract the price
column to calculate the return
df1 = pd.merge_asof(
left=df,
right=df.assign(time=df['time'] - pd.Timedelta('15s')),
on='time', direction='forward', suffixes=['', '_r']
)
df1['return'] = df1.pop('price_r') - df1['price']
Result
time price return
0 2022-09-01 09:33:00 100.0 -0.6
1 2022-09-01 09:33:14 101.0 -0.1
2 2022-09-01 09:33:16 99.4 NaN
3 2022-09-01 09:33:30 100.9 NaN
CodePudding user response:
Please, try this (but I don't believe the output is very meaningful :-( ). Is it what you expected? (I realized this code assigned the return for the previous "15" seconds, not the next "15" seconds. But this is how the return is usually indexed - by the time when it is realized, not when it is still expected for the future).
import numpy as np
import pandas as pd
df = pd.DataFrame([['2022-09-01 09:33:00', 100.], ['2022-09-01 09:33:14', 101.], ['2022-09-01 09:33:16', 99.4], ['2022-09-01 09:33:30', 100.9]], columns=['time', 'price'])
df['time'] = pd.to_datetime(df['time'])
df = df.sort_values('time').reset_index(drop=True)
df.loc[:, 'return'] = df['price'].diff()
df['time_diff'] = df['time'].diff()
df['15sec_or_more'] = (df['time_diff'] >= np.timedelta64(15, 's'))
for k, i in enumerate(df.index):
if k:
if not df.loc[i,'15sec_or_more']:
temp = df.iloc[k:].loc[:,['return','time_diff']].cumsum(axis=0)
conds = (temp['time_diff'] >= np.timedelta64(15, 's'))
if conds.sum():
true_return_index = conds.idxmax()
df.loc[i, 'return'] = df.loc[true_return_index, 'return']
else:
df.loc[i, 'return'] = np.nan
df = df[['time', 'price' ,'return']]
print(df)