I have this dataset:
import pandas as pd
d = {'Agreements': ["Rome", "NewYork", "Paris", "Tokyo"], 'Year': [2012, 2012, 2013, 2013],
'Provision1': [1, 1, 1, 1], 'Provision2': [1, 1, 0, 1], 'Provision3': [0, 1, 1, 0]}
df = pd.DataFrame(data=d)
Agreements | Year | Provision1 | Provision2 | Provision3 |
---|---|---|---|---|
Rome | 2012 | 1 | 1 | 0 |
NewYork | 2012 | 1 | 1 | 1 |
Paris | 2013 | 1 | 0 | 1 |
Tokyo | 2013 | 1 | 1 | 0 |
I would like to group by to obtain as output:
Year | Count Agreements per Year | Count Provisions per Year | Average Provision per Year |
---|---|---|---|
2012 | 2 | 5 | 2.5 |
2013 | 2 | 4 | 2 |
I have tried with df.groupby('Year')['Agreements'].count().reset_index(name='counts')
but I do not know how to expand it to obtain the output I desire. Thanks
CodePudding user response:
Try:
out = df.assign(
Provisions=df.loc[:, "Provision1":"Provision3"].sum(axis=1)
).pivot_table(
index="Year", aggfunc={"Agreements": "count", "Provisions": ("sum", "mean")}
)
out.columns = [
f'{b.capitalize().replace("Mean", "Average")} {a} per Year' for a, b in out.columns
]
print(out.reset_index().to_markdown(index=False))
Prints:
Year | Count Agreements per Year | Average Provisions per Year | Sum Provisions per Year |
---|---|---|---|
2012 | 2 | 2.5 | 5 |
2013 | 2 | 2 | 4 |
EDIT: To add column with the count of "Agreements with at least one Provision":
out = df.assign(
Provisions=df.loc[:, "Provision1":"Provision3"].sum(axis=1),
AggreementsWithAtLeastOneProvision=df.loc[:, "Provision1":"Provision3"].any(axis=1)
).pivot_table(
index="Year", aggfunc={"Agreements": "count", "AggreementsWithAtLeastOneProvision": "sum", "Provisions": ("sum", "mean")}
)
out.columns = [
f'{b.capitalize().replace("Mean", "Average")} {a} per Year' for a, b in out.columns
]
print(out.reset_index().to_markdown(index=False))
Prints:
Year | Sum AggreementsWithAtLeastOneProvision per Year | Count Agreements per Year | Average Provisions per Year | Sum Provisions per Year |
---|---|---|---|---|
2012 | 2 | 3 | 1.66667 | 5 |
2013 | 2 | 2 | 2 | 4 |
Input data in this case was:
Agreements Year Provision1 Provision2 Provision3
0 Bratislava 2012 0 0 0
1 Rome 2012 1 1 0
2 NewYork 2012 1 1 1
3 Paris 2013 1 0 1
4 Tokyo 2013 1 1 0
CodePudding user response:
You can use melt
and agg
:
out = (df.assign(Average=lambda x: x.filter(like='Provision').sum(axis=1))
.melt(id_vars=['Year', 'Agreements', 'Average'], var_name='Provision').groupby('Year')
.agg(**{'Count Agreements per Year': ('Agreements', 'nunique'),
'Count Provisions per Year': ('value', 'sum'),
'Average Provision per Year': ('Average', 'mean')})
.reset_index())
Output:
>>> out
Year Count Agreements per Year Count Provisions per Year Average Provision per Year
0 2012 2 5 2.5
1 2013 2 4 2.0
CodePudding user response:
There are multiple ways to approach this, here is a simple one:
First, combine all "Provision" columns into one column:
# combining all Provision columns into one column
df['All_Provisions'] = df['Provision1'] df['Provision2'] df['Provision3']
Second, aggregate the columns using .agg, which takes your columns and desired aggregations as a dictionary. In our case, we want to:
- count "Agreements" column
- sum "All_Provisions" column
- average "All_Provisions" column
We can do it like:
# aggregating columns
df = df.groupby('Year', as_index=False).agg({'Agreements':'count', 'All_Provisions':['sum', 'mean']})
Finally, rename your columns:
# renaming columns
df.columns = ['Year','Count Agreements per Year','Count Provisions per Year','Average Provision per Year']
Hope this helps :)
CodePudding user response:
Another possible solution:
(df.iloc[:,1:].set_index('Year').stack().to_frame()
.pivot_table(index='Year', values=0, aggfunc=[lambda x:
x.count() / (df.shape[1]-2), 'sum', lambda x: x.sum()/df['Year'].nunique()])
.set_axis(['Count Agreements per Year', 'Count Provisions per Year',
'Average Provision per Year'], axis=1).reset_index())
Output:
Year Count Agreements per Year Count Provisions per Year \
0 2012 2.0 5
1 2013 2.0 4
Average Provision per Year
0 2.5
1 2.0