I have a dataframe that looks has a breakdown by date-bucket-group (i.e. for each day, we have many buckets and within those buckets we have two groups) and looks like this:
date | bucket | Group |purchase
2020-01-01 | 1 | A | 12
2020-01-01 | 1 | B | 11
2020-01-01 | 2 | A | 14
2020-01-01 | 2 | B | 14
2020-02-01 | 1 | A | 11
2020-02-01 | 1 | B | 10
I would like to create a new dataframe, with a "difference" column that looks like this:
date | bucket | purchase | difference
2020-01-01 | 1 | 12-11=1 (Group A - Group B for that day/bucket)
2020-01-01 | 2 | 0
2020-02-01 | 1 | 1
How can I shape my df in such way?
CodePudding user response:
You can reshape your dataframe before compute the diff:
out = (df.set_index(['date', 'bucket', 'group'])['purchase']
.unstack('group').diff(-1, axis=1)['A']
.rename('difference').reset_index())
print(out)
# Output
date bucket difference
0 2020-01-01 1 1
1 2020-01-01 2 0
2 2020-02-01 1 1
Or with pivot
:
out = (df.pivot(['date', 'bucket'], 'group', 'purchase')
.diff(-1, axis=1)['A'].rename('difference').reset_index())
CodePudding user response:
When needing to select and align, it's often easier to set the columns as index. Here using xs
to select the groups:
s = df.set_index(['date', 'bucket', 'Group'])['purchase'].rename('difference')
(s.xs('A', level='Group')-s.xs('B', level='Group')).reset_index()
Variant using a DataFrame as intermediate, then renaming (useful to handle multiple columns):
df2 = df.set_index(['date', 'bucket', 'Group'])
(df2.xs('A', level='Group')-df2.xs('B', level='Group')
).reset_index().rename(columns={'purchase': 'difference'})
Output:
date bucket difference
0 2020-01-01 1 1
1 2020-01-01 2 0
2 2020-02-01 1 1
CodePudding user response:
These methods doesn't rely on the values of the Group column, just that there are two ordered groups.
out = (df.set_index(['date', 'bucket', 'Group'])['purchase']
.groupby(['date', 'bucket'])
.diff(-1)
.dropna()
.droplevel(-1)
.reset_index(name='difference'))
print(out)
# OR
out = (df.groupby(['date', 'bucket'])
.apply(lambda x: x.groupby('Group')['purchase']
.sum()
.diff(-1)
.dropna()))
out.columns = ['difference']
out = out.reset_index()
print(out)
Output:
date bucket difference
0 2020-01-01 1 1.0
1 2020-01-01 2 0.0
2 2020-02-01 1 1.0
1-liner of mozway's:
(df.set_index(['date', 'bucket', 'Group'])['purchase']
.agg(lambda x: x.xs('A', level='Group').sub(x.xs('B', level='Group')))
.reset_index(name='difference'))
Output:
date bucket difference
0 2020-01-01 1 1
1 2020-01-01 2 0
2 2020-02-01 1 1
CodePudding user response:
Use <code>DataFrame.pivot_table</code> with <code>DataFrame.sub</code>:
df1 = df.pivot_table(index=['date','bucket'], columns='purchase',
values='value')
.sub(df1['B'], axis=0)
.rename(columns={'A':'difference'})
.reset_index()
print (df1)
purchase date bucket difference
0 2020-01-01 2020-01-01 1 1
1 2020-02-01 2020-02-01 1 1