A small sample of my data:
pd.DataFrame({'date': {0: Timestamp('2021-08-01 00:00:00'),
1: Timestamp('2022-08-01 00:00:00'),
2: Timestamp('2021-08-01 00:00:00'),
3: Timestamp('2021-08-01 00:00:00'),
4: Timestamp('2022-08-01 00:00:00'),
5: Timestamp('2022-08-01 00:00:00')},
'customer_nr': {0: 2, 1: 3, 2: 2, 3: 3, 4: 2, 5: 2},
'product_nr': {0: 3, 1: 2, 2: 2, 3: 1, 4: 2, 5: 1},
'age': {0: 32.0, 1: 32.0, 2: 32.0, 3: 32.0, 4: 32.0, 5: 37.0},
'gender': {0: 'M', 1: 'M', 2: 'M', 3: 'M', 4: 'M', 5: 'M'},
'age_group': {0: '25-34',
1: '25-34',
2: '25-34',
3: '25-34',
4: '25-34',
5: '35-44'}} )
Then I want to re-group like this:
df.groupby(['date','product_nr','age_group']).age.count().unstack()
Which looks like:
age_group | 25-34 | 35-34 | |
---|---|---|---|
date | product_nr | ||
2021-08-01 | 1 | 1 | NaN |
2 | 1 | NaN | |
3 | 1 | NaN | |
2022-08-01 | 1 | NaN | 1 |
2 | 2 | NaN |
Next step (?)
The next thing I want to do is to calculate how many percent product sales has increased from the first date to the second date, for each age_group. However, when I'm using groupby like this I find the new dataframe to be hard to work with.
Desired result:
product_nr | 25-34 | 35-44 | 45-54 | 55-64 |
---|---|---|---|---|
1 | x% | x% | x% | x% |
2 | x% | x% | x% | x% |
3 | x% | x% | x% | x% |
PS, the original data-set consists of significantly more products and customers. I should also mention that there are not as many product_nr for both years and they are not arranged in the same order.
CodePudding user response:
t = df.groupby(['date','product_nr','age_group']).age.count().unstack()
Apply percentage change grouping on product_nr:(This code works also if you have multiple dates)
output = pd.DataFrame()
for group,df in t.groupby('product_nr'):
temp = ((df/df.shift(1))-1)*100
output = pd.concat([output,temp])
output.reset_index(inplace=True)
output:
age_group date product_nr 25-34 35-44
0 2021-08-01 1 NaN NaN
1 2022-08-01 1 NaN NaN
2 2021-08-01 2 NaN NaN
3 2022-08-01 2 100.0 NaN
4 2021-08-01 3 NaN NaN
Get the output for the date needed:
output[output['date'] == '2022-08-01']
Final output:
age_group date product_nr 25-34 35-44
1 2022-08-01 1 NaN NaN
3 2022-08-01 2 100.0 NaN
CodePudding user response:
You can save that new grouped df and fill the nans in it with the fillna method.
df_group = df.groupby(['date','product_nr','age_group']).age.count().unstack().fillna(0)
Then you can save 2022 and 2021 data to new variables
df_2022 = df_group.loc["2022-08-01"]
df_2021 = df_group.loc["2021-08-01"]
And subtract them from eachother and divide by original to get percentage difference.
(df_2022 - df_2021).divide(df_2021)
CodePudding user response:
use Groupby.apply with Series.pct change
df['pct_ch']=(df.groupby(columns)[age.count].apply(pd.series.pct_change) 1)