Let's say I have data like:
df = pd.DataFrame(
{
"product": np.random.choice(["p1", "p2", "p3"], 1000),
"sales": np.random.randint(0, 1000, 1000),
"april": np.random.randint(0, 2, 1000),
"march": np.random.randint(0, 2, 1000),
"february": np.random.randint(0, 2, 1000),
}
)
where:
- "product" is product category
- "sales" are total sales for the product
- "april", "march", "february", ... are flags showing if a specific product was on sale for a specific month.
and I have billions of rows like these.
Then I aggregate data like this through SQL:
df.groupby(["april", "march", "february", "product"]).agg({"sales": "sum"})
and pass it to pandas.
Is there a nice way to pivot data, so that I can display:
- product names as rows
- months as columns
- and active sales as values
without explicit iterating through months and joining results one-by-one?
CodePudding user response:
What you can do is melt your DF (turning it long rather than wide) keeping the product and sales as columns row but having your month column broken down into a separate column that you can then use to pivot accordingly, and filter out only for that month where the product was on sale, eg:
pt = (
df.melt(id_vars=['product', 'sales'], var_name='month')
.query('value == 1')
.pivot_table(index='product', columns='month', values='sales', aggfunc='sum')
)
This given the random sample I ended up having gives you:
month april february march
product
p1 83867 76609 80505
p2 74135 67823 64843
p3 94081 84000 107928