I am trying to aggregate sales data using Pandas. Each line of the input file has a date, sales, category and date where there can be multiple entries for a category for a date.
import pandas as pd
from datetime import date
df = pd.DataFrame( [
{ 'Date': date(2022,4,1), 'Category': 'Food', 'Amount': 11.0 },
{ 'Date': date(2022,4,1), 'Category': 'Soda', 'Amount': 3.0 },
{ 'Date': date(2022,4,1), 'Category': 'Soda', 'Amount': 2.0 },
{ 'Date': date(2022,4,1), 'Category': 'Food', 'Amount': 13.0 },
{ 'Date': date(2022,4,2), 'Category': 'Candy', 'Amount': 1.0 },
{ 'Date': date(2022,4,2), 'Category': 'Candy', 'Amount': 0.5 },
{ 'Date': date(2022,4,2), 'Category': 'Food', 'Amount': 15.0 },
{ 'Date': date(2022,4,2), 'Category': 'Soda', 'Amount': 2.0 },
{ 'Date': date(2022,4,2), 'Category': 'Soda', 'Amount': 1.0 },
{ 'Date': date(2022,4,3), 'Category': 'Candy', 'Amount': 2.0 },
{ 'Date': date(2022,4,3), 'Category': 'Food', 'Amount': 18.0 },
{ 'Date': date(2022,4,3), 'Category': 'Food', 'Amount': 11.0 },
] )
I can use groupby to sum the entries for a category and I end up with a multi-index on Date and Category:
b = df.groupby(['Date', 'Category']).sum()
print(b)
Amount
Date Category
2022-04-01 Food 24.00
Soda 5.00
2022-04-02 Candy 1.75
Food 15.00
Soda 5.00
2022-04-03 Candy 0.60
Food 11.00
Soda 3.00
How can I transform this so the different categories are columns with the date as the index, something like this:
Food Soda Candy
2022-04-01 24.0 5.0 0.0
2022-04-02 15.0 5.0 1.75
2022-04-03 11.0 3.0 0.6
I've tried pivot tables, crosstabs (xs) and unstacking and can't figure out the right Pandas commands to get there!
CodePudding user response:
Using crosstab
:
import numpy as np
out = (pd.crosstab(df['Date'], df['Category'], df['Amount'], aggfunc=np.sum)
.fillna(0)
)
Output:
Category Candy Food Soda
Date
2022-04-01 0.0 24.0 5.0
2022-04-02 1.5 15.0 3.0
2022-04-03 2.0 29.0 0.0
Modification of your method with unstack
:
out = (df.groupby(['Date', 'Category'])['Amount'].sum()
.unstack(fill_value=0)
)
CodePudding user response:
You can use pd.pivot_table
with sum
as aggfunc
df.pivot_table(index='Date', columns='Category', values='Amount', aggfunc='sum', fill_value=0)
Output:
Category Candy Food Soda
Date
2022-04-01 0.0 24 5
2022-04-02 1.5 15 3
2022-04-03 2.0 29 0
Because you mentioned next to crosstab (mozway's answer) and pivot_table also unstack
, here a way you could do it with that:
df.set_index(['Date', 'Category'],append=True).unstack().groupby('Date').sum()