I have a dataframe like as below
ID,Region,year,output
1,ANZ,1978,1
1,ANZ,2019,1
1,ANZ,2021,1
1,ASEAN,2021,1
1,ASEAN,2021,2
1,ASEAN,2020,3
2,UK,2021,8
2,UK,2021,1
2,UK,2021,0
I would like to do the below
a) create 4 year columns year_2019,year_2020,year_2021 and year_2022
.
b) put the count of records for each ID
under the respective year
columns
I tried the below
pd.crosstab(
index=tf['ID'], columns=tf['year'],
values=tf['year'], aggfunc='count').fillna(0)
But the problem is it convert all year values as columns in output dataframe.
My real dataframe has 4 million rows and has 50 unique year values.(from 1970 to 2022)
But I want the count of year values only for 2019,2020,2021 and 2022.
I expect my output to be like as below
ID,year_2019,year_2020,year_2021,year_2022
1,1,1,3,0
2,0,0,3,0
CodePudding user response:
You can use:
out = (pd
.crosstab(df['ID'], df['year'])
.reindex(range(2019, 2022 1), axis=1, fill_value=0)
.add_prefix('year_')
.reset_index()
.rename_axis(columns=None)
)
output:
ID year_2019 year_2020 year_2021 year_2022
0 1 1 1 3 0
1 2 0 0 3 0