How to efficiently get rolling windows stats from a pandas dataframe for arbitrary dates-CodePudding

I have a pandas Dataframe of events like the following:

date	user	amount
2021-01-01	Adam	10
2021-01-01	Bernice	15
2021-01-02	Adam	5
2021-01-06	Carl	8

And a pandas Series of dates like ["2021-01-03", "2021-01-12"]

I'm trying to collect stats about the events for the 7 days before each date in the dates series. My target output looks like this:

date	unique_users	average_amount	total_amount
2021-01-03	2	10	30
2021-01-12	1	8	8

Is there an efficient way to do this in Pandas? This is my current solution, but it doesn't leverage pandas functions so is pretty slow:

import datetime as dt

import pandas as pd


df = pd.DataFrame({
    "date": [dt.date(2021, 1, 1), dt.date(2021, 1, 1), dt.date(2021, 1, 2), dt.date(2021, 1, 6)],
    "user": ["a", "b", "a", "c"],
    "amount": [10, 15, 5, 8],
    })
dates = pd.Series([dt.date(2021, 1, 3), dt.date(2021, 1, 12)])

records = []
for date in dates:
    idx = (
        (df["date"] < date)
        & (df["date"] >= date - dt.timedelta(days=7))
        )
    filtered = df.loc[idx]
    record = {
        "date": date,
        "unique_users": filtered["user"].nunique(),
        "average_amount": filtered["amount"].mean(),
        "total_amount": filtered["amount"].sum(),
        }
    records.append(record)
df2 = pd.DataFrame(records)
print(df2)

CodePudding user response：

Your question looks like an interval search ( and unique values as well), in which case, you can use pandas interval index :

df['date'] = pd.to_datetime(df['date'])
end_date = pd.to_datetime(dates)
start_date = end_date - pd.Timedelta(days=7)
intervals = pd.IntervalIndex.from_arrays(start_date, end_date, closed='both')
(df.assign(date = end_date[intervals.get_indexer(df.date)].array)
   .groupby('date')
   .agg(unique_users=('user', 'nunique'), 
        average_amount=('amount', 'mean'), 
        total_amount = ('amount', 'sum'))
) 
            unique_users  average_amount  total_amount
date                                                  
2021-01-03             2            10.0            30
2021-01-12             1             8.0             8

You might have to do a bit of filtering with the end_date[intervals.get_indexer(df.date)]; in the example data you shared, all values are present

CodePudding user response：

This approach is similar to @sammywemmy but will doesn't require the dates you want to sample to be 7 days, or more, apart.

setup

df = pd.DataFrame({
    "date": [pd.Timestamp(2021, 1, 1), pd.Timestamp(2021, 1, 1), pd.Timestamp(2021, 1, 2), pd.Timestamp(2021, 1, 6)],
    "user": ["a", "b", "a", "c"],
    "amount": [10, 15, 5, 8],
})
dates = pd.Series([pd.Timestamp(2021, 1, 3), pd.Timestamp(2021, 1, 12)])

solution

Involves piso, a package with methods for pandas intervals, interval index etc

import piso

contained = piso.contains(
    pd.IntervalIndex.from_arrays(dates-pd.Timedelta("7d"), dates),
    df["date"],
).transpose()

contained is indexed by original dates in df and a column for each interval window.

            (2020-12-27, 2021-01-03]  (2021-01-05, 2021-01-12]
2021-01-01                      True                     False
2021-01-01                      True                     False
2021-01-02                      True                     False
2021-01-06                     False                      True

We then create a function which is applied to each column

def calc(col):
    filtered = df.loc[col.values]
    return pd.Series({
        "unique_users":filtered["user"].nunique(),
        "average":filtered["amount"].mean(),
        "total":filtered["amount"].sum()
    })

result = contained.apply(calc).transpose()

result looks like this

                          unique_users  average  total
(2020-12-27, 2021-01-03]           2.0     10.0   30.0
(2021-01-05, 2021-01-12]           1.0      8.0    8.0

You can change the index to be the original sample dates with

result.index = result.index.right