Pandas get nunique using crosstab and custom columns-CodePudding

I have a dataframe like as below

ID,design_id,year,output
1,21345,1978,1
1,3456,2019,1
1,5678,2021,1
1,7890,2021,1
1,5678,2021,2
1,1357,2020,3
2,9876,2021,8
2,9865,2021,1
2,9678,2021,0

I would like to do the below

a) create 4 year columns year_2019,year_2020,year_2021 and year_2022.

b) put the count of unique design_id for each ID under the respective year columns

I tried the below 2 approaches with the help of this post

pd.crosstab(
      index=tf['ID'], columns=tf['year'],
      values=tf['design_id'], aggfunc='nunique').fillna(0)

-------------------------------------

out = (pd
 .crosstab(df['ID'], df['year'])
 .reindex(range(2019, 2022 1), axis=1, fill_value=0)
 .add_prefix('year_')
 .reset_index()
 .rename_axis(columns=None)
)

My real dataframe has 4 million rows and has 50 unique year values.(from 1970 to 2022)

But I want the count of unique design id (and not the count of records) for each ID under each year columns 2019,2020,2021 and 2022.

I expect my output to be like as below

ID,year_2019,year_2020,year_2021,year_2022
1,1,1,2,0
2,0,0,3,0

CodePudding user response：

Try:

y = [2019, 2020, 2021, 2022]


print(
    df[df.year.isin(y)]
    .pivot_table(
        index="ID",
        columns="year",
        values="design_id",
        aggfunc="nunique",
        fill_value=0,
    )
    .reindex(y, axis="columns", fill_value=0)
    .add_prefix("year_")
)

Prints:

year  year_2019  year_2020  year_2021  year_2022
ID                                              
1             1          1          2          0
2             0          0          3          0

CodePudding user response：

df.drop(columns=['output'], inplace=True)
df_grouped = df.groupby(['ID', 'year']).agg('nunique')
final_df = df_grouped.unstack(['year']).fillna(0)

gives final_df as

     design_id               
year      1978 2019 2020 2021
ID                           
1          1.0  1.0  1.0  2.0
2          0.0  0.0  0.0  3.0

Does this satisfy your requirements?

CodePudding user response：

Here is a way to get the result you ask for in your question:

y = df[['ID','year','design_id']].drop_duplicates().groupby(['ID','year']).count().unstack(
    fill_value=0).droplevel(0, axis=1).reindex(
    fill_value=0, columns=[2019, 2020, 2021, 2022]).add_prefix('year_').reset_index()
y.columns.name = None

Output:

   ID  year_2019  year_2020  year_2021  year_2022
0   1          1          1          2          0
1   2          0          0          3          0

Explanation:

Using drop_duplicates() on the ID, year key together with design_id then count() on groupby() of the key gets the desired counts
Using unstack() creates columns using the years with data for one or more ID values
Using droplevel() eliminates the unneeded design_id level of the column multiindex (which resulted from unstack()), after which using reindex() with the desired years 2019, 2020, 2021, 2022 followed by add_prefix() creates the desired column labels year_2019, year_2020, year_2021, year_2022 with the corresponding data (zero-filled where missing, such as for 2022)
Using reset_index() moves ID from the index to a new column.

CodePudding user response：

based on the solution you indicated. you can change columns names as follows

df2=pd.crosstab(
      index=df['ID'], columns=df['year'],
      values=df['design_id'], aggfunc='nunique').fillna(0)
df2.columns = ["year_" str(col) for col in np.arange(2019,2023)]
df2.reset_index()

    ID  year_2019   year_2020   year_2021   year_2022
0   1      1.0         1.0         1.0        2.0
1   2      0.0         0.0         0.0        3.0

df2=pd.crosstab(
      index=df['ID'], columns=df['year'],
      values=df['design_id'], aggfunc='nunique').fillna(0).add_prefix('year_')
df2

year    year_1978   year_2019   year_2020   year_2021
ID              
1         1.0         1.0         1.0         2.0
2         0.0         0.0         0.0         3.0