Problems when using groupby, how to calculate increase after groupby?-CodePudding

A small sample of my data:

pd.DataFrame({'date': {0: Timestamp('2021-08-01 00:00:00'),
  1: Timestamp('2022-08-01 00:00:00'),
  2: Timestamp('2021-08-01 00:00:00'),
  3: Timestamp('2021-08-01 00:00:00'),
  4: Timestamp('2022-08-01 00:00:00'),
  5: Timestamp('2022-08-01 00:00:00')},
 'customer_nr': {0: 2, 1: 3, 2: 2, 3: 3, 4: 2, 5: 2},
 'product_nr': {0: 3, 1: 2, 2: 2, 3: 1, 4: 2, 5: 1},
 'age': {0: 32.0, 1: 32.0, 2: 32.0, 3: 32.0, 4: 32.0, 5: 37.0},
 'gender': {0: 'M', 1: 'M', 2: 'M', 3: 'M', 4: 'M', 5: 'M'},
 'age_group': {0: '25-34',
  1: '25-34',
  2: '25-34',
  3: '25-34',
  4: '25-34',
  5: '35-44'}} )

Then I want to re-group like this:

df.groupby(['date','product_nr','age_group']).age.count().unstack() Which looks like:

	age_group	25-34	35-34
date	product_nr
2021-08-01	1	1	NaN
	2	1	NaN
	3	1	NaN
2022-08-01	1	NaN	1
	2	2	NaN

Next step (?)

The next thing I want to do is to calculate how many percent product sales has increased from the first date to the second date, for each age_group. However, when I'm using groupby like this I find the new dataframe to be hard to work with.

Desired result:

product_nr	25-34	35-44	45-54	55-64
1	x%	x%	x%	x%
2	x%	x%	x%	x%
3	x%	x%	x%	x%

PS, the original data-set consists of significantly more products and customers. I should also mention that there are not as many product_nr for both years and they are not arranged in the same order.

CodePudding user response：

t = df.groupby(['date','product_nr','age_group']).age.count().unstack()

Apply percentage change grouping on product_nr:(This code works also if you have multiple dates)

output = pd.DataFrame()
for group,df in t.groupby('product_nr'):
    temp = ((df/df.shift(1))-1)*100
    output = pd.concat([output,temp])
output.reset_index(inplace=True)

output:

age_group       date  product_nr  25-34  35-44
0         2021-08-01           1    NaN    NaN
1         2022-08-01           1    NaN    NaN
2         2021-08-01           2    NaN    NaN
3         2022-08-01           2  100.0    NaN
4         2021-08-01           3    NaN    NaN

Get the output for the date needed:

output[output['date'] == '2022-08-01']

Final output:

age_group       date  product_nr  25-34  35-44
1         2022-08-01           1    NaN    NaN
3         2022-08-01           2  100.0    NaN

CodePudding user response：

You can save that new grouped df and fill the nans in it with the fillna method.

df_group = df.groupby(['date','product_nr','age_group']).age.count().unstack().fillna(0)

Then you can save 2022 and 2021 data to new variables

df_2022 = df_group.loc["2022-08-01"]
df_2021 = df_group.loc["2021-08-01"]

And subtract them from eachother and divide by original to get percentage difference.

(df_2022 - df_2021).divide(df_2021)

CodePudding user response：

use Groupby.apply with Series.pct change

df['pct_ch']=(df.groupby(columns)[age.count].apply(pd.series.pct_change) 1)