I have a pandas Dataframe of events like the following:
date | user | amount |
---|---|---|
2021-01-01 | Adam | 10 |
2021-01-01 | Bernice | 15 |
2021-01-02 | Adam | 5 |
2021-01-06 | Carl | 8 |
And a pandas Series of dates like ["2021-01-03", "2021-01-12"]
I'm trying to collect stats about the events for the 7 days before each date in the dates series. My target output looks like this:
date | unique_users | average_amount | total_amount |
---|---|---|---|
2021-01-03 | 2 | 10 | 30 |
2021-01-12 | 1 | 8 | 8 |
Is there an efficient way to do this in Pandas? This is my current solution, but it doesn't leverage pandas functions so is pretty slow:
import datetime as dt
import pandas as pd
df = pd.DataFrame({
"date": [dt.date(2021, 1, 1), dt.date(2021, 1, 1), dt.date(2021, 1, 2), dt.date(2021, 1, 6)],
"user": ["a", "b", "a", "c"],
"amount": [10, 15, 5, 8],
})
dates = pd.Series([dt.date(2021, 1, 3), dt.date(2021, 1, 12)])
records = []
for date in dates:
idx = (
(df["date"] < date)
& (df["date"] >= date - dt.timedelta(days=7))
)
filtered = df.loc[idx]
record = {
"date": date,
"unique_users": filtered["user"].nunique(),
"average_amount": filtered["amount"].mean(),
"total_amount": filtered["amount"].sum(),
}
records.append(record)
df2 = pd.DataFrame(records)
print(df2)
CodePudding user response:
Your question looks like an interval search ( and unique values as well), in which case, you can use pandas interval index :
df['date'] = pd.to_datetime(df['date'])
end_date = pd.to_datetime(dates)
start_date = end_date - pd.Timedelta(days=7)
intervals = pd.IntervalIndex.from_arrays(start_date, end_date, closed='both')
(df.assign(date = end_date[intervals.get_indexer(df.date)].array)
.groupby('date')
.agg(unique_users=('user', 'nunique'),
average_amount=('amount', 'mean'),
total_amount = ('amount', 'sum'))
)
unique_users average_amount total_amount
date
2021-01-03 2 10.0 30
2021-01-12 1 8.0 8
You might have to do a bit of filtering with the end_date[intervals.get_indexer(df.date)]
; in the example data you shared, all values are present
CodePudding user response:
This approach is similar to @sammywemmy but will doesn't require the dates you want to sample to be 7 days, or more, apart.
setup
df = pd.DataFrame({
"date": [pd.Timestamp(2021, 1, 1), pd.Timestamp(2021, 1, 1), pd.Timestamp(2021, 1, 2), pd.Timestamp(2021, 1, 6)],
"user": ["a", "b", "a", "c"],
"amount": [10, 15, 5, 8],
})
dates = pd.Series([pd.Timestamp(2021, 1, 3), pd.Timestamp(2021, 1, 12)])
solution
Involves piso, a package with methods for pandas intervals, interval index etc
import piso
contained = piso.contains(
pd.IntervalIndex.from_arrays(dates-pd.Timedelta("7d"), dates),
df["date"],
).transpose()
contained
is indexed by original dates in df
and a column for each interval window.
(2020-12-27, 2021-01-03] (2021-01-05, 2021-01-12]
2021-01-01 True False
2021-01-01 True False
2021-01-02 True False
2021-01-06 False True
We then create a function which is applied to each column
def calc(col):
filtered = df.loc[col.values]
return pd.Series({
"unique_users":filtered["user"].nunique(),
"average":filtered["amount"].mean(),
"total":filtered["amount"].sum()
})
result = contained.apply(calc).transpose()
result
looks like this
unique_users average total
(2020-12-27, 2021-01-03] 2.0 10.0 30.0
(2021-01-05, 2021-01-12] 1.0 8.0 8.0
You can change the index to be the original sample dates with
result.index = result.index.right