I have a dataframe indexed by stock ticker and date that's pretty sparse, something like:
df = pd.DataFrame({
'ticker': ['SPY', 'GOOGL', 'GOOGL', 'TSLA', 'TSLA'],
'date': ['2021-01-01', '2021-09-01', '2021-09-21', '2021-09-21', '2021-09-22'],
'price': [430.0, 2500.0, 2600.0, 700.0, 710.0],
}).astype({'date': 'datetime64[ns]'}).set_index(['ticker', 'date'])
price
ticker date
SPY 2021-01-01 430.0
GOOGL 2021-09-01 2500.0
2021-09-21 2600.0
TSLA 2021-09-21 700.0
2021-09-22 710.0
I want to end up with a dataframe that has the last three days of data, as best we know, e.g.,
want = pd.DataFrame({
'ticker': ['SPY', 'SPY', 'SPY', 'GOOGL', 'GOOGL', 'GOOGL', 'TSLA', 'TSLA', 'TSLA'],
'date': ['2021-09-20', '2021-09-21', '2021-09-22', '2021-09-20', '2021-09-21', '2021-09-22', '2021-09-20', '2021-09-21', '2021-09-22'],
'price': [430.0, 430.0, 430.0, 2500.0, 2600.0, 2600.0, 0.0, 700.0, 710.0],
}).astype({'date': 'datetime64[ns]'}).set_index(['ticker', 'date'])
price
ticker date
SPY 2021-09-20 430.0
2021-09-21 430.0
2021-09-22 430.0
GOOGL 2021-09-20 2500.0
2021-09-21 2600.0
2021-09-22 2600.0
TSLA 2021-09-20 0.0
2021-09-21 700.0
2021-09-22 710.0
I've figured out a couple of ways to do this, so far I think the clearest is a groupby with a custom apply, i.e.,
OUTPUT_DATES = pd.date_range(
start=pd.Timestamp.today() - pd.DateOffset(days=2),
end=pd.Timestamp.today(),
freq='D')
def LastNDays(df):
return (
df
.reset_index(level=0, drop=True)
.reindex(OUTPUT_DATES, method='ffill')
.rename_axis('date')
.fillna(0))
df.groupby(level=0).apply(LastNDays)
And this works. However, it's also really slow for my actual dataset (several hundred thousand data points). I think it's all the reindexing? This seems like a pretty common task for pandas (take some weird stock data, make it conform to data points) so I feel like there's probably a better way to do this but I don't even know what to search for. Any ideas on how to make this faster?
CodePudding user response:
You'll probably see a huge improvement in performance using an asof
merge. First create all the rows you need from the cartesian product of unique ticker labels and the last three dates. Then perform the merge to bring over the closest value (same date or in the past), contruct and sort the MultiIndex, and fill missing values with 0.
import pandas as pd
dates = pd.date_range(pd.to_datetime('today').normalize(), freq='-1D', periods=3)
#DatetimeIndex(['2021-09-22', '2021-09-21', '2021-09-20'], dtype='datetime64[ns]', freq='-1D')
df1 = pd.DataFrame(product(dates, df.index.get_level_values('ticker').unique()),
columns=['date', 'ticker'])
result = (pd.merge_asof(df1.sort_values('date'), df.reset_index().sort_values('date'),
by='ticker', on='date', direction='backward')
.set_index(['ticker', 'date'])
.sort_index()
.fillna(0, downcast='infer')
)
print(result)
price
ticker date
GOOGL 2021-09-20 2500
2021-09-21 2600
2021-09-22 2600
SPY 2021-09-20 430
2021-09-21 430
2021-09-22 430
TSLA 2021-09-20 0
2021-09-21 700
2021-09-22 710