I have a data frame let's call it df it has various columns I am interested in values for a particular year and count the column for that year so I use the following
df =
ClientI OrderStartDate FunderCode
0 U27 2017-05-22 H2
1 U28 2017-05-22 H2
2 U28 2018-09-27 H3
3 U28 2019-03-19 H4
4 U29 2017-05-22 H2
HCPL2017 = df_c[df_c['OrderStartDate'].dt.year == 2017]['FunderCode'].value_counts().reset_index().rename(columns={'index': 'FCode', 'FunderCode': 'Count_2017'})
FunderCode Count_2017
0 HCPL2 431
1 HCPL4 188
2 HCPL3 59
3 HCPL1 2
Similarly I did for the year 2018 , 2019 etc
Then I merge using
pd.merge(pd.merge(HCPL2017, HCPL2018,on = "FunderCode"), HCPL2019,on ="FunderCode")
In the end, I got a merged table
FunderCode Count_2017 Count_2018 Count_2019
3 H1 2 85 207
0 H2 431 591 724
2 H3 59 205 372
1 H4 188 201 282
Is there a way to make the process quicker to get the information from the original data frame? Like I have many more years, wondering if I can filter all of them and get the count in a few steps?
CodePudding user response:
You can use pandas pivot_table:
df["year"] = pd.to_datetime(df["OrderStartDate"]).dt.year
(
pd.pivot_table(df, values="ClientI", index="FunderCode", columns="year", aggfunc="count")
.add_prefix("Count_")
.reset_index()
.rename_axis(None, axis=1)
#.fillna(0) if you want to replace the nan's
)
Output:
---- -------------- -------------- -------------- --------------
| | FunderCode | Count_2017 | Count_2018 | Count_2019 |
|---- -------------- -------------- -------------- --------------|
| 0 | H2 | 3 | nan | nan |
| 1 | H3 | nan | 1 | nan |
| 2 | H4 | nan | nan | 1 |
---- -------------- -------------- -------------- --------------