I have a dataframe like as below
ID,design_id,year,category
1,21345,1978,DC
1,3456,2019,DC
1,5678,2021,DF
1,7890,2021,DC
1,5678,2021,OT
1,1357,2020,np.nan
2,9876,2021,DC
2,9865,2021,DC
2,9876,2021,DC
I would like to do the below
a) create 3 year columns (2020,2021,2022) for variables design_id
, category == DC
and category == DF
.
b) put the count of unique design_id
for each ID
under the respective year
columns
c) put the count of category == DC
and category == DF
for each ID and design_id
combo under the respective year
columns
I tried the below but this only works for requirement a) above.
pd.crosstab(
index=tf['ID'], columns=tf['year'],
values=tf['design_id'], aggfunc='nunique').fillna(0)
My real dataframe has 4 million rows and has 50 unique year values.(from 1970 to 2022)
I expect my output to be like as below
ID,design_cnt2020,design_cnt2021,design_cnt2022,DC_cnt_2020,DC_cnt_2021,DC_cnt_2022,DF_cnt_2020,DF_cnt_2021,Df_cnt_2022
1,1,2,0,0,1,0,0,1,0
2,0,2,0,0,2,0,0,0,0
CodePudding user response:
Use:
#define range of years
r = range(2020, 2023)
#dynamic count years
#y = pd.Timestamp('now').year
#r = list(range(y-2, y 1))
#because huge df filter expected years
tf = tf[tf['year'].isin(r)]
#processing unique counts per year and add missing years
df1 = (pd.crosstab(
index=tf['ID'], columns=tf['year'],
values=tf['design_id'], aggfunc='nunique')
.fillna(0)
.astype(int)
.reindex(r, axis=1, fill_value=0)
.add_prefix('design_cnt'))
#processing unique counts per year and category
df2 = pd.crosstab(
index=tf['ID'], columns=[tf['year'], tf['category']],
values=tf['design_id'], aggfunc='nunique').fillna(0).astype(int)
#add missing combinations year, category
mux = pd.MultiIndex.from_product([r, tf['category'].unique()])
df2 = df2.reindex(mux, axis=1, fill_value=0).sort_index(level=[1,0], axis=1)
#flatten MultiIndex
df2.columns = df2.columns.map(lambda x: f'{x[1]}_cnt_{x[0]}')
#join both DataFrames
df = df1.join(df2)
print (df)
design_cnt2020 design_cnt2021 design_cnt2022 DC_cnt_2020 DC_cnt_2021 \
ID
1 1 2 0 0 1
2 0 2 0 0 2
DC_cnt_2022 DF_cnt_2020 DF_cnt_2021 DF_cnt_2022
ID
1 0 0 1 0
2 0 0 0 0