I would like to add a single count column to the dataframe when using dataframe.mean()
or through some other way (results will be scatter plotted).
I can't use .agg(["count","mean"])
because it adds count
and mean
columns for each quantitive metric making the dataframe very large and harder to plot.
import pandas as pd
df = pd.DataFrame(
{
'store':['store_a', 'store_a', 'store_b', 'store_b', 'store_b', 'store_c',],
'location':['North America', 'North America', 'Europe', 'Europe', 'Europe', 'Europe'],
'sales':[10, 5, 20, 20, 20, 10],
'cost':[5, 4, 10, 14, 14, 8],
}
)
df_grouped = df.groupby(["store", "location"]).mean()
df_grouped
Actual:
store location sales cost
store_a North America 7.5 4.5
store_b Europe 20.0 12.66
store_c Europe 10.0 8.0
Expected:
store location count sales cost
store_a North America 2 7.5 4.5
store_b Europe 3 20.0 12.66
store_c Europe 1 10.0 8.0
CodePudding user response:
You could use Named aggregation
here:
df_grouped = df.groupby(['store','location'], as_index=False)\
.agg(count=('location','count'),
sales=('sales','mean'),
cost=('cost','mean'))
print(df_grouped)
store location count sales cost
0 store_a North America 2 7.5 4.500000
1 store_b Europe 3 20.0 12.666667
2 store_c Europe 1 10.0 8.000000
It doesn't really matter which column you choose for "count" (or indeed "size"), both store
or location
will lead to the same column.
If you have numerous columns and don't want to add the aggregations manually, you can use a dictionary comprehension. E.g. something like this:
cols = [col for col in df.columns if col != 'location']
# ['store', 'sales', 'cost']
df_grouped = df.groupby(["store", "location"], as_index=False).agg(**{
(col if col != 'store' else 'count'):
((col,'mean') if col != 'store' else (col,'count')) for col in cols})
# passing:
{'count': ('store', 'count'),
'sales': ('sales', 'mean'),
'cost': ('cost', 'mean')}
print(df_groupby)
store location count sales cost
0 store_a North America 2 7.5 4.500000
1 store_b Europe 3 20.0 12.666667
2 store_c Europe 1 10.0 8.000000
CodePudding user response:
I presume you need the store count. Use agg
df.groupby(['store','location']).agg(**{'sales': ('sales', 'mean'),
'cost': ('cost', 'mean'),
'count': ('store', 'size') })
store location sales cost count
0 store_a North America 7.5 4.500000 2
1 store_b Europe 20.0 12.666667 3
2 store_c Europe 10.0 8.000000 1
With your new requests try:
df.groupby(['store','location']).agg(**{col: (col, 'count') if col in (['store']) else (col, 'mean') for col in df.drop(columns=['location']).columns }).rename(columns={'store':'count'})