I have a data frame like that
week | co_week | Revenue | Country |
---|---|---|---|
19/09/2021 | 01/10/2021 | 120 | US |
19/09/2021 | 03/10/2021 | 120 | US |
19/09/2021 | 03/10/2021 | 120 | FR |
20/09/2021 | 05/10/2021 | 15 | US |
20/09/2021 | 06/10/2021 | 34 | FR |
Now I'd like to get the sum of Revenue for each unique week date , so the output should look like that
week | co_week | Revenue | Total_revenue | Country |
---|---|---|---|---|
19/09/2021 | 01/10/2021 | 120 | 135 | US |
19/09/2021 | 03/10/2021 | 120 | 135 | US |
19/09/2021 | 03/10/2021 | 120 | 154 | FR |
20/09/2021 | 05/10/2021 | 15 | 135 | US |
20/09/2021 | 06/10/2021 | 34 | 154 | FR |
I tried using the following code , but I ended with a lot of Null values in the new column due to the dropping of most indexes!
df['Total_revenue'] = df.drop_duplicates(['Revenue']).groupby(df['Country'])['Total_spend'].transform('sum')
CodePudding user response:
IIUC, you might want to drop_duplicates
in the groupby
:
df['Total_revenue'] = (df.groupby('Country')['Revenue']
.transform(lambda x: x.drop_duplicates().sum())
)
But this might be biased if you have several times the same revenue on different weeks by chance.
Better use a mask
to hide the duplicated values, then groupby.transform('sum')
:
df['Total_revenue'] = (df['Revenue']
.mask(df.duplicated(['week', 'Country']))
.groupby(df['Country']).transform('sum')
)
output:
week co_week Revenue Country Total_revenue
0 19/09/2021 01/10/2021 120 US 135
1 19/09/2021 03/10/2021 120 US 135
2 19/09/2021 03/10/2021 120 FR 154
3 20/09/2021 05/10/2021 15 US 135
4 20/09/2021 06/10/2021 34 FR 154