Home > front end >  Nested Group by with count and average
Nested Group by with count and average

Time:02-01

I have this dataset:

import pandas as pd

d = {'Agreements': ["Rome", "NewYork", "Paris", "Tokyo"], 'Year': [2012, 2012, 2013, 2013], 
     'Provision1': [1, 1, 1, 1], 'Provision2': [1, 1, 0, 1], 'Provision3': [0, 1, 1, 0]}

df = pd.DataFrame(data=d)
Agreements Year Provision1 Provision2 Provision3
Rome 2012 1 1 0
NewYork 2012 1 1 1
Paris 2013 1 0 1
Tokyo 2013 1 1 0

I would like to group by to obtain as output:

Year Count Agreements per Year Count Provisions per Year Average Provision per Year
2012 2 5 2.5
2013 2 4 2

I have tried with df.groupby('Year')['Agreements'].count().reset_index(name='counts') but I do not know how to expand it to obtain the output I desire. Thanks

CodePudding user response:

Try:

out = df.assign(
    Provisions=df.loc[:, "Provision1":"Provision3"].sum(axis=1)
).pivot_table(
    index="Year", aggfunc={"Agreements": "count", "Provisions": ("sum", "mean")}
)
out.columns = [
    f'{b.capitalize().replace("Mean", "Average")} {a} per Year' for a, b in out.columns
]
print(out.reset_index().to_markdown(index=False))

Prints:

Year Count Agreements per Year Average Provisions per Year Sum Provisions per Year
2012 2 2.5 5
2013 2 2 4

EDIT: To add column with the count of "Agreements with at least one Provision":

out = df.assign(
    Provisions=df.loc[:, "Provision1":"Provision3"].sum(axis=1),
    AggreementsWithAtLeastOneProvision=df.loc[:, "Provision1":"Provision3"].any(axis=1)
).pivot_table(
    index="Year", aggfunc={"Agreements": "count", "AggreementsWithAtLeastOneProvision": "sum", "Provisions": ("sum", "mean")}
)
out.columns = [
    f'{b.capitalize().replace("Mean", "Average")} {a} per Year' for a, b in out.columns
]
print(out.reset_index().to_markdown(index=False))

Prints:

Year Sum AggreementsWithAtLeastOneProvision per Year Count Agreements per Year Average Provisions per Year Sum Provisions per Year
2012 2 3 1.66667 5
2013 2 2 2 4

Input data in this case was:

   Agreements  Year  Provision1  Provision2  Provision3
0  Bratislava  2012           0           0           0
1        Rome  2012           1           1           0
2     NewYork  2012           1           1           1
3       Paris  2013           1           0           1
4       Tokyo  2013           1           1           0

CodePudding user response:

You can use melt and agg:

out = (df.assign(Average=lambda x: x.filter(like='Provision').sum(axis=1))
         .melt(id_vars=['Year', 'Agreements', 'Average'], var_name='Provision').groupby('Year')
         .agg(**{'Count Agreements per Year': ('Agreements', 'nunique'),
                 'Count Provisions per Year': ('value', 'sum'),
                 'Average Provision per Year': ('Average', 'mean')})
         .reset_index())

Output:

>>> out
   Year  Count Agreements per Year  Count Provisions per Year  Average Provision per Year
0  2012                          2                          5                         2.5
1  2013                          2                          4                         2.0

CodePudding user response:

There are multiple ways to approach this, here is a simple one:

First, combine all "Provision" columns into one column:

# combining all Provision columns into one column
df['All_Provisions'] = df['Provision1']   df['Provision2']   df['Provision3']

Second, aggregate the columns using .agg, which takes your columns and desired aggregations as a dictionary. In our case, we want to:

  1. count "Agreements" column
  2. sum "All_Provisions" column
  3. average "All_Provisions" column

We can do it like:

# aggregating columns
df = df.groupby('Year', as_index=False).agg({'Agreements':'count', 'All_Provisions':['sum', 'mean']})

Finally, rename your columns:

# renaming columns
df.columns = ['Year','Count Agreements per Year','Count Provisions per Year','Average Provision per Year']

Hope this helps :)

CodePudding user response:

Another possible solution:

(df.iloc[:,1:].set_index('Year').stack().to_frame()
 .pivot_table(index='Year', values=0, aggfunc=[lambda x: 
     x.count() / (df.shape[1]-2), 'sum', lambda x: x.sum()/df['Year'].nunique()])
 .set_axis(['Count Agreements per Year', 'Count Provisions per Year', 
                  'Average Provision per Year'], axis=1).reset_index())

Output:

   Year  Count Agreements per Year  Count Provisions per Year  \
0  2012                        2.0                          5   
1  2013                        2.0                          4   

   Average Provision per Year  
0                         2.5  
1                         2.0  
  • Related