Home > Enterprise >  How to get evenly-spaced data quickly with a MultiIndex in pandas
How to get evenly-spaced data quickly with a MultiIndex in pandas

Time:09-23

I have a dataframe indexed by stock ticker and date that's pretty sparse, something like:

df = pd.DataFrame({
    'ticker': ['SPY', 'GOOGL', 'GOOGL', 'TSLA', 'TSLA'],
    'date': ['2021-01-01', '2021-09-01', '2021-09-21', '2021-09-21', '2021-09-22'],
    'price': [430.0, 2500.0, 2600.0, 700.0, 710.0],
}).astype({'date': 'datetime64[ns]'}).set_index(['ticker', 'date'])
                    price
ticker  date    
SPY     2021-01-01  430.0
GOOGL   2021-09-01  2500.0
        2021-09-21  2600.0
TSLA    2021-09-21  700.0
        2021-09-22  710.0

I want to end up with a dataframe that has the last three days of data, as best we know, e.g.,

want = pd.DataFrame({
    'ticker': ['SPY', 'SPY', 'SPY', 'GOOGL', 'GOOGL', 'GOOGL', 'TSLA', 'TSLA', 'TSLA'],
    'date': ['2021-09-20', '2021-09-21', '2021-09-22', '2021-09-20', '2021-09-21', '2021-09-22', '2021-09-20', '2021-09-21', '2021-09-22'],
    'price': [430.0, 430.0, 430.0, 2500.0, 2600.0, 2600.0, 0.0, 700.0, 710.0],
}).astype({'date': 'datetime64[ns]'}).set_index(['ticker', 'date'])
                    price
ticker  date    
SPY     2021-09-20  430.0
        2021-09-21  430.0
        2021-09-22  430.0
GOOGL   2021-09-20  2500.0
        2021-09-21  2600.0
        2021-09-22  2600.0
TSLA    2021-09-20  0.0
        2021-09-21  700.0
        2021-09-22  710.0

I've figured out a couple of ways to do this, so far I think the clearest is a groupby with a custom apply, i.e.,

OUTPUT_DATES = pd.date_range(
    start=pd.Timestamp.today() - pd.DateOffset(days=2),
    end=pd.Timestamp.today(),
    freq='D')

def LastNDays(df):
  return (
      df
      .reset_index(level=0, drop=True)
      .reindex(OUTPUT_DATES, method='ffill')
      .rename_axis('date')
      .fillna(0))
  
df.groupby(level=0).apply(LastNDays)

And this works. However, it's also really slow for my actual dataset (several hundred thousand data points). I think it's all the reindexing? This seems like a pretty common task for pandas (take some weird stock data, make it conform to data points) so I feel like there's probably a better way to do this but I don't even know what to search for. Any ideas on how to make this faster?

CodePudding user response:

You'll probably see a huge improvement in performance using an asof merge. First create all the rows you need from the cartesian product of unique ticker labels and the last three dates. Then perform the merge to bring over the closest value (same date or in the past), contruct and sort the MultiIndex, and fill missing values with 0.

import pandas as pd

dates = pd.date_range(pd.to_datetime('today').normalize(), freq='-1D', periods=3)
#DatetimeIndex(['2021-09-22', '2021-09-21', '2021-09-20'], dtype='datetime64[ns]', freq='-1D')    

df1 = pd.DataFrame(product(dates, df.index.get_level_values('ticker').unique()),
                   columns=['date', 'ticker'])

result = (pd.merge_asof(df1.sort_values('date'), df.reset_index().sort_values('date'),
                        by='ticker', on='date', direction='backward')
            .set_index(['ticker', 'date'])
            .sort_index()
            .fillna(0, downcast='infer')
         )

print(result)

                   price
ticker date             
GOOGL  2021-09-20   2500
       2021-09-21   2600
       2021-09-22   2600
SPY    2021-09-20    430
       2021-09-21    430
       2021-09-22    430
TSLA   2021-09-20      0
       2021-09-21    700
       2021-09-22    710
  • Related