Home > Software design >  How to slice very large dataframe by days without walking the entire dataframe each time?
How to slice very large dataframe by days without walking the entire dataframe each time?

Time:11-08

I have two massive datetime indexed and sorted dataframes that I need compare groups from one to groups from another.

start, end = df.index.min(), df.index.max()

for day in pd.date_range(start.date(), end.date() a_day, freq='D'):
     current_df = df[df.index.date == day.date()]
     current_df2= df2[df2.index.date == day.date()]

     do_heavy_lift(df, df2)

I thought the heavy lift would take most of the time but the slicing by date takes > 95% of the time.

Upon reflection, each time I slice a dataframe it is probably walking the entire index.

Is there any way to significantly improve this approach? The indexes are sorted. Can I:

  1. Tell it to stop searching after the end of the current day?
  2. Remember where it was at the end of the previous day and start there for the next day?

You can create a sample dataset using:

import numpy as np
from datetime import datetime, timedelta

date_today = datetime.now()
days = pd.date_range(date_today, date_today   timedelta(7000), freq='H')

np.random.seed(seed=1111)
data = np.random.randint(1, high=100, size=len(days))
df = pd.DataFrame({'test': days, 'col2': data})
df = df.set_index('test')

np.random.seed(seed=1345)
data = np.random.randint(1, high=100, size=len(days))
df2 = pd.DataFrame({'test': days, 'col2': data})
df2 = df2.set_index('test')
df.sort_index(inplace=True)
df2.sort_index(inplace=True)

CPU times: user 9min 50s, sys: 4.27 s, total: 9min 54s Wall time: 10min 45s

CodePudding user response:

You can use groupby_apply. Normalize your datetime by keeping the date part.

def do_heavy_lift(df):
    # do stuff here
    return ...

out = pd.concat([df1, df2], axis=1) \
        .groupby(lambda x: x.normalize()) \
        .apply(heavy_lift)

CodePudding user response:

Use DataFrame.resample to group the data daily. Then iterate over groups, instead of explicitly iterating over the dates and slicing the DataFrames each time.

for (date, g1), (date, g2) in zip(df.resample('1D'), df2.resample('1D')):
    do_heavy_lift(g1, g2)
  • Related