I have a dataframe like as below
ID,design_id,year,output
1,21345,1978,1
1,3456,2019,1
1,5678,2021,1
1,7890,2021,1
1,5678,2021,2
1,1357,2020,3
2,9876,2021,8
2,9865,2021,1
2,9678,2021,0
I would like to do the below
a) create 4 year columns year_2019,year_2020,year_2021 and year_2022
.
b) put the count of unique design_id
for each ID
under the respective year
columns
I tried the below 2 approaches with the help of this post
pd.crosstab(
index=tf['ID'], columns=tf['year'],
values=tf['design_id'], aggfunc='nunique').fillna(0)
-------------------------------------
out = (pd
.crosstab(df['ID'], df['year'])
.reindex(range(2019, 2022 1), axis=1, fill_value=0)
.add_prefix('year_')
.reset_index()
.rename_axis(columns=None)
)
My real dataframe has 4 million rows and has 50 unique year values.(from 1970 to 2022)
But I want the count of unique design id (and not the count of records) for each ID
under each year
columns 2019,2020,2021 and 2022
.
I expect my output to be like as below
ID,year_2019,year_2020,year_2021,year_2022
1,1,1,2,0
2,0,0,3,0
CodePudding user response:
Try:
y = [2019, 2020, 2021, 2022]
print(
df[df.year.isin(y)]
.pivot_table(
index="ID",
columns="year",
values="design_id",
aggfunc="nunique",
fill_value=0,
)
.reindex(y, axis="columns", fill_value=0)
.add_prefix("year_")
)
Prints:
year year_2019 year_2020 year_2021 year_2022
ID
1 1 1 2 0
2 0 0 3 0
CodePudding user response:
df.drop(columns=['output'], inplace=True)
df_grouped = df.groupby(['ID', 'year']).agg('nunique')
final_df = df_grouped.unstack(['year']).fillna(0)
gives final_df
as
design_id
year 1978 2019 2020 2021
ID
1 1.0 1.0 1.0 2.0
2 0.0 0.0 0.0 3.0
Does this satisfy your requirements?
CodePudding user response:
Here is a way to get the result you ask for in your question:
y = df[['ID','year','design_id']].drop_duplicates().groupby(['ID','year']).count().unstack(
fill_value=0).droplevel(0, axis=1).reindex(
fill_value=0, columns=[2019, 2020, 2021, 2022]).add_prefix('year_').reset_index()
y.columns.name = None
Output:
ID year_2019 year_2020 year_2021 year_2022
0 1 1 1 2 0
1 2 0 0 3 0
Explanation:
- Using
drop_duplicates()
on theID, year
key together withdesign_id
thencount()
ongroupby()
of the key gets the desired counts - Using
unstack()
creates columns using the years with data for one or moreID
values - Using
droplevel()
eliminates the unneededdesign_id
level of the column multiindex (which resulted fromunstack()
), after which usingreindex()
with the desired years2019, 2020, 2021, 2022
followed byadd_prefix()
creates the desired column labelsyear_2019, year_2020, year_2021, year_2022
with the corresponding data (zero-filled where missing, such as for 2022) - Using
reset_index()
movesID
from the index to a new column.
CodePudding user response:
based on the solution you indicated. you can change columns names as follows
df2=pd.crosstab(
index=df['ID'], columns=df['year'],
values=df['design_id'], aggfunc='nunique').fillna(0)
df2.columns = ["year_" str(col) for col in np.arange(2019,2023)]
df2.reset_index()
ID year_2019 year_2020 year_2021 year_2022
0 1 1.0 1.0 1.0 2.0
1 2 0.0 0.0 0.0 3.0
OR
df2=pd.crosstab(
index=df['ID'], columns=df['year'],
values=df['design_id'], aggfunc='nunique').fillna(0).add_prefix('year_')
df2
year year_1978 year_2019 year_2020 year_2021
ID
1 1.0 1.0 1.0 2.0
2 0.0 0.0 0.0 3.0