Home > Software design >  How to reshape dataframe to get a subtraction of two segments for a given day and bucket?
How to reshape dataframe to get a subtraction of two segments for a given day and bucket?

Time:07-31

I have a dataframe that looks has a breakdown by date-bucket-group (i.e. for each day, we have many buckets and within those buckets we have two groups) and looks like this:

date       | bucket |  Group  |purchase
2020-01-01 | 1      |  A      | 12
2020-01-01 | 1      |  B      | 11

2020-01-01 | 2      |  A      | 14
2020-01-01 | 2      |  B      | 14

2020-02-01 | 1      |  A      | 11
2020-02-01 | 1      |  B      | 10

I would like to create a new dataframe, with a "difference" column that looks like this:

date       | bucket |  purchase | difference
2020-01-01 | 1      |  12-11=1 (Group A - Group B for that day/bucket)
2020-01-01 | 2      |  0
2020-02-01 | 1      |  1

How can I shape my df in such way?

CodePudding user response:

You can reshape your dataframe before compute the diff:

out = (df.set_index(['date', 'bucket', 'group'])['purchase']
         .unstack('group').diff(-1, axis=1)['A']
         .rename('difference').reset_index())
print(out)

# Output
         date  bucket  difference
0  2020-01-01       1           1
1  2020-01-01       2           0
2  2020-02-01       1           1

Or with pivot:

out = (df.pivot(['date', 'bucket'], 'group', 'purchase')
         .diff(-1, axis=1)['A'].rename('difference').reset_index())

CodePudding user response:

When needing to select and align, it's often easier to set the columns as index. Here using xs to select the groups:

s = df.set_index(['date', 'bucket', 'Group'])['purchase'].rename('difference')

(s.xs('A', level='Group')-s.xs('B', level='Group')).reset_index()

Variant using a DataFrame as intermediate, then renaming (useful to handle multiple columns):

df2 = df.set_index(['date', 'bucket', 'Group'])

(df2.xs('A', level='Group')-df2.xs('B', level='Group')
 ).reset_index().rename(columns={'purchase': 'difference'})

Output:

         date  bucket  difference
0  2020-01-01       1           1
1  2020-01-01       2           0
2  2020-02-01       1           1

CodePudding user response:

These methods doesn't rely on the values of the Group column, just that there are two ordered groups.

out = (df.set_index(['date', 'bucket', 'Group'])['purchase']
         .groupby(['date', 'bucket'])
         .diff(-1)
         .dropna()
         .droplevel(-1)
         .reset_index(name='difference'))
print(out)

# OR 

out = (df.groupby(['date', 'bucket'])
         .apply(lambda x: x.groupby('Group')['purchase']
                           .sum()
                           .diff(-1)
                           .dropna()))
out.columns = ['difference']
out = out.reset_index()
print(out)

Output:

          date  bucket  difference
0  2020-01-01        1         1.0
1  2020-01-01        2         0.0
2  2020-02-01        1         1.0

1-liner of mozway's:

(df.set_index(['date', 'bucket', 'Group'])['purchase']
   .agg(lambda x: x.xs('A', level='Group').sub(x.xs('B', level='Group')))
   .reset_index(name='difference'))

Output:

         date  bucket  difference
0  2020-01-01       1           1
1  2020-01-01       2           0
2  2020-02-01       1           1

CodePudding user response:

Use <code>DataFrame.pivot_table</code> with <code>DataFrame.sub</code>:
df1 = df.pivot_table(index=['date','bucket'], columns='purchase', 
values='value')
   .sub(df1['B'], axis=0)
   .rename(columns={'A':'difference'})
   .reset_index()
print (df1)
  purchase       date  bucket  difference
  0 2020-01-01 2020-01-01       1           1
  1 2020-02-01 2020-02-01       1           1
 
  • Related