Home > Software design >  Add a single count column to a groupby Pandas dataframe while getting the means from the quantity co
Add a single count column to a groupby Pandas dataframe while getting the means from the quantity co

Time:10-16

I would like to add a single count column to the dataframe when using dataframe.mean() or through some other way (results will be scatter plotted).

I can't use .agg(["count","mean"]) because it adds count and mean columns for each quantitive metric making the dataframe very large and harder to plot.

import pandas as pd

df = pd.DataFrame(
  {
    'store':['store_a', 'store_a', 'store_b', 'store_b', 'store_b', 'store_c',],
    'location':['North America', 'North America', 'Europe', 'Europe', 'Europe', 'Europe'],
    'sales':[10, 5, 20, 20, 20, 10],
    'cost':[5, 4, 10, 14, 14, 8],
  }
)
df_grouped = df.groupby(["store", "location"]).mean()
df_grouped

Actual:

store      location         sales   cost
store_a    North America    7.5     4.5
store_b    Europe           20.0    12.66
store_c    Europe           10.0    8.0

Expected:

store      location         count   sales   cost
store_a    North America    2       7.5     4.5
store_b    Europe           3       20.0    12.66
store_c    Europe           1       10.0    8.0

CodePudding user response:

You could use Named aggregation here:

df_grouped = df.groupby(['store','location'], as_index=False)\
    .agg(count=('location','count'), 
         sales=('sales','mean'), 
         cost=('cost','mean'))

print(df_grouped)

     store       location  count  sales       cost
0  store_a  North America      2    7.5   4.500000
1  store_b         Europe      3   20.0  12.666667
2  store_c         Europe      1   10.0   8.000000

It doesn't really matter which column you choose for "count" (or indeed "size"), both store or location will lead to the same column.


If you have numerous columns and don't want to add the aggregations manually, you can use a dictionary comprehension. E.g. something like this:

cols = [col for col in df.columns if col != 'location']
# ['store', 'sales', 'cost']

df_grouped = df.groupby(["store", "location"], as_index=False).agg(**{
    (col if col != 'store' else 'count'): 
        ((col,'mean') if col != 'store' else (col,'count')) for col in cols})
    
# passing:
    {'count': ('store', 'count'),
     'sales': ('sales', 'mean'),
     'cost': ('cost', 'mean')}

print(df_groupby)

     store       location  count  sales       cost
0  store_a  North America      2    7.5   4.500000
1  store_b         Europe      3   20.0  12.666667
2  store_c         Europe      1   10.0   8.000000

CodePudding user response:

I presume you need the store count. Use agg

df.groupby(['store','location']).agg(**{'sales': ('sales', 'mean'),
                           'cost': ('cost', 'mean'),
                           'count': ('store', 'size') })


   store       location     sales    cost   count
0  store_a  North America    7.5   4.500000      2
1  store_b         Europe   20.0  12.666667      3
2  store_c         Europe   10.0   8.000000      1

With your new requests try:

df.groupby(['store','location']).agg(**{col: (col, 'count') if col in (['store']) else (col, 'mean') for col in df.drop(columns=['location']).columns }).rename(columns={'store':'count'})
  • Related