Home > Enterprise >  Is there an efficient way to iterate over Pandas DataFrame chunks?
Is there an efficient way to iterate over Pandas DataFrame chunks?

Time:01-29

I am working with time series data and I want to apply a function to each data frame chunk for rolling time intervals/windows. When I use rolling() and apply() on a Pandas DataFrame, it applies the function iteratively for each column given a time interval. Here's example code:

  • Sample data

In:

df = pd.DataFrame({'A': [1, 2, 3, 4, 5, 6], 
                'B': [2, 4, 6, 8, 10, 12]}, 
                index=pd.date_range('2019-01-01', periods=6, freq='5T'))
print(df)

Out:

                     A   B
2019-01-01 00:00:00  1   2
2019-01-01 00:05:00  2   4
2019-01-01 00:10:00  3   6
2019-01-01 00:15:00  4   8
2019-01-01 00:20:00  5  10
2019-01-01 00:25:00  6  12
  • Output when using the combination of rolling() and apply():

In:

print(df.rolling('15T', min_periods=2).apply(lambda x: x.sum().sum()))

Out:

                        A     B
2019-01-01 00:00:00   NaN   NaN
2019-01-01 00:05:00   3.0   6.0
2019-01-01 00:10:00   6.0  12.0
2019-01-01 00:15:00   9.0  18.0
2019-01-01 00:20:00  12.0  24.0
2019-01-01 00:25:00  15.0  30.0

Desired Out:

2019-01-01 00:00:00     NaN
2019-01-01 00:05:00     9.0
2019-01-01 00:10:00    18.0
2019-01-01 00:15:00    27.0
2019-01-01 00:20:00    36.0
2019-01-01 00:25:00    45.0
Freq: 5T, dtype: float64

Currently, I am using a for loop to do the job, but I am looking for a more efficient way to handle this operation. I would appreciate it if you can provide a solution within the Pandas framework or even with other libraries.

Note: Please do not take the example function (summation) seriously, assume that the function in interest requires iterating over the chunks of datasets as is, i.e., with no prior column operations.

Thanks in advance!

CodePudding user response:

You can use the apply() function with a custom function that takes a DataFrame as an argument and returns a single value.

def custom_func(df):
    return df.sum().sum()

df.rolling('15T', min_periods=2).apply(custom_func)

Out: 

2019-01-01 00:00:00     NaN
2019-01-01 00:05:00     9.0
2019-01-01 00:10:00    18.0
2019-01-01 00:15:00    27.0
2019-01-01 00:20:00    36.0
2019-01-01 00:25:00    45.0

CodePudding user response:

You were very close

Proposed script

import pandas as pd

df = pd.DataFrame({'A': [1, 2, 3, 4, 5, 6], 
                'B': [2, 4, 6, 8, 10, 12]}, 
                index=pd.date_range('2019-01-01', periods=6, freq='5T'))

df = (df.rolling('15T', min_periods=2)
        .apply(lambda x: x.sum())
        .apply(lambda x: x.sum(), axis=1)
        )

print(df)

Result

2019-01-01 00:00:00     0.0
2019-01-01 00:05:00     9.0
2019-01-01 00:10:00    18.0
2019-01-01 00:15:00    27.0
2019-01-01 00:20:00    36.0
2019-01-01 00:25:00    45.0
Freq: 5T, dtype: float64
  • Related