I have a dataframe of purchases:
product_id count timestamp customer_id
1 1 2021-10-04 10:20 a
1 3 2021-10-04 10:21 b
2 4 2021-10-04 10:00 c
1 2 2021-10-03 10:00 c
I use the following groupby and agg to create a report of the sum and mean of the count, and the latest purchase timestamp.
report = (
df.groupby(product_id).agg(
sum=pd.NamedAgg(column="count", aggfunc="sum"),
mean_count=pd.NamedAgg(column="count", aggfunc="mean"),
latest_purchase_time=pd.NamedAgg(column="timestamp", aggfunc="max")
)
)
I want to include in this report the customer_id that corresponds to the latest purchase timestamp. Is there a way to do this?
e.g.:
product_id sum mean_count latest_purchase_timestamp *customer_id*
1 6 2 2021-10-04 10:21 b
2 4 4 2021-10-04 10:00 c
CodePudding user response:
First convert customer_id
to index
, so possible get value by maximal timestamp
by idxmax
:
report = (
df.set_index('customer_id')
.groupby('product_id').agg(
sum=pd.NamedAgg(column="count", aggfunc="sum"),
mean_count=pd.NamedAgg(column="count", aggfunc="mean"),
latest_purchase_time=pd.NamedAgg(column="timestamp", aggfunc="max"),
customer_id=pd.NamedAgg(column="timestamp", aggfunc="idxmax")
)
)
print (report)
sum mean_count latest_purchase_time customer_id
product_id
1 6 2 2021-10-04 10:21:00 b
2 4 4 2021-10-04 10:00:00 c