Home > Software engineering >  Pandas get nunique using crosstab and custom columns
Pandas get nunique using crosstab and custom columns

Time:06-22

I have a dataframe like as below

ID,design_id,year,output
1,21345,1978,1
1,3456,2019,1
1,5678,2021,1
1,7890,2021,1
1,5678,2021,2
1,1357,2020,3
2,9876,2021,8
2,9865,2021,1
2,9678,2021,0

I would like to do the below

a) create 4 year columns year_2019,year_2020,year_2021 and year_2022.

b) put the count of unique design_id for each ID under the respective year columns

I tried the below 2 approaches with the help of this post

pd.crosstab(
      index=tf['ID'], columns=tf['year'],
      values=tf['design_id'], aggfunc='nunique').fillna(0)

-------------------------------------

out = (pd
 .crosstab(df['ID'], df['year'])
 .reindex(range(2019, 2022 1), axis=1, fill_value=0)
 .add_prefix('year_')
 .reset_index()
 .rename_axis(columns=None)
)

My real dataframe has 4 million rows and has 50 unique year values.(from 1970 to 2022)

But I want the count of unique design id (and not the count of records) for each ID under each year columns 2019,2020,2021 and 2022.

I expect my output to be like as below

ID,year_2019,year_2020,year_2021,year_2022
1,1,1,2,0
2,0,0,3,0

CodePudding user response:

Try:

y = [2019, 2020, 2021, 2022]


print(
    df[df.year.isin(y)]
    .pivot_table(
        index="ID",
        columns="year",
        values="design_id",
        aggfunc="nunique",
        fill_value=0,
    )
    .reindex(y, axis="columns", fill_value=0)
    .add_prefix("year_")
)

Prints:

year  year_2019  year_2020  year_2021  year_2022
ID                                              
1             1          1          2          0
2             0          0          3          0

CodePudding user response:

df.drop(columns=['output'], inplace=True)
df_grouped = df.groupby(['ID', 'year']).agg('nunique')
final_df = df_grouped.unstack(['year']).fillna(0)

gives final_df as

     design_id               
year      1978 2019 2020 2021
ID                           
1          1.0  1.0  1.0  2.0
2          0.0  0.0  0.0  3.0

Does this satisfy your requirements?

CodePudding user response:

Here is a way to get the result you ask for in your question:

y = df[['ID','year','design_id']].drop_duplicates().groupby(['ID','year']).count().unstack(
    fill_value=0).droplevel(0, axis=1).reindex(
    fill_value=0, columns=[2019, 2020, 2021, 2022]).add_prefix('year_').reset_index()
y.columns.name = None

Output:

   ID  year_2019  year_2020  year_2021  year_2022
0   1          1          1          2          0
1   2          0          0          3          0

Explanation:

  • Using drop_duplicates() on the ID, year key together with design_id then count() on groupby() of the key gets the desired counts
  • Using unstack() creates columns using the years with data for one or more ID values
  • Using droplevel() eliminates the unneeded design_id level of the column multiindex (which resulted from unstack()), after which using reindex() with the desired years 2019, 2020, 2021, 2022 followed by add_prefix() creates the desired column labels year_2019, year_2020, year_2021, year_2022 with the corresponding data (zero-filled where missing, such as for 2022)
  • Using reset_index() moves ID from the index to a new column.

CodePudding user response:

based on the solution you indicated. you can change columns names as follows

df2=pd.crosstab(
      index=df['ID'], columns=df['year'],
      values=df['design_id'], aggfunc='nunique').fillna(0)
df2.columns = ["year_" str(col) for col in np.arange(2019,2023)]
df2.reset_index()

    ID  year_2019   year_2020   year_2021   year_2022
0   1      1.0         1.0         1.0        2.0
1   2      0.0         0.0         0.0        3.0

OR

df2=pd.crosstab(
      index=df['ID'], columns=df['year'],
      values=df['design_id'], aggfunc='nunique').fillna(0).add_prefix('year_')
df2
year    year_1978   year_2019   year_2020   year_2021
ID              
1         1.0         1.0         1.0         2.0
2         0.0         0.0         0.0         3.0
  • Related