Pandas find value count cross tab for past 3 years-CodePudding

I have a dataframe like as below

ID,design_id,year,category
1,21345,1978,DC
1,3456,2019,DC
1,5678,2021,DF
1,7890,2021,DC
1,5678,2021,OT
1,1357,2020,np.nan
2,9876,2021,DC
2,9865,2021,DC
2,9876,2021,DC

I would like to do the below

a) create 3 year columns (2020,2021,2022) for variables design_id, category == DC and category == DF.

b) put the count of unique design_id for each ID under the respective year columns

c) put the count of category == DC and category == DF for each ID and design_id combo under the respective year columns

I tried the below but this only works for requirement a) above.

pd.crosstab(
      index=tf['ID'], columns=tf['year'],
      values=tf['design_id'], aggfunc='nunique').fillna(0)

My real dataframe has 4 million rows and has 50 unique year values.(from 1970 to 2022)

I expect my output to be like as below

ID,design_cnt2020,design_cnt2021,design_cnt2022,DC_cnt_2020,DC_cnt_2021,DC_cnt_2022,DF_cnt_2020,DF_cnt_2021,Df_cnt_2022
1,1,2,0,0,1,0,0,1,0
2,0,2,0,0,2,0,0,0,0

CodePudding user response：

Use:

#define range of years
r = range(2020, 2023)

#dynamic count years
#y = pd.Timestamp('now').year
#r = list(range(y-2, y 1))

#because huge df filter expected years
tf = tf[tf['year'].isin(r)]

#processing unique counts per year and add missing years
df1 = (pd.crosstab(
      index=tf['ID'], columns=tf['year'],
      values=tf['design_id'], aggfunc='nunique')
       .fillna(0)
       .astype(int)
       .reindex(r, axis=1, fill_value=0)
       .add_prefix('design_cnt'))

#processing unique counts per year and category
df2 = pd.crosstab(
      index=tf['ID'], columns=[tf['year'], tf['category']],
      values=tf['design_id'], aggfunc='nunique').fillna(0).astype(int)

#add missing combinations year, category
mux = pd.MultiIndex.from_product([r, tf['category'].unique()])
df2 = df2.reindex(mux, axis=1, fill_value=0).sort_index(level=[1,0], axis=1)
#flatten MultiIndex
df2.columns = df2.columns.map(lambda x: f'{x[1]}_cnt_{x[0]}')

#join both DataFrames
df = df1.join(df2)
print (df)


    design_cnt2020  design_cnt2021  design_cnt2022  DC_cnt_2020  DC_cnt_2021  \
ID                                                                             
1                1               2               0            0            1   
2                0               2               0            0            2   

    DC_cnt_2022  DF_cnt_2020  DF_cnt_2021  DF_cnt_2022  
ID                                                      
1             0            0            1            0  
2             0            0            0            0