Home > front end >  Calculating a difference for groups within dataframe
Calculating a difference for groups within dataframe

Time:05-15

I have a dataframe structured like the example, df, below. This contains 2 variables, time and state. Since these are repeated observations for identity, I wanted to get a vector of differences between times 1 and 2, for each level of state. I did this with the code below, but I feel like this should be an easier task than what I have done.

Is there some way to make this more efficient and reliable? For example if I have many levels for state, I would need to manually do their difference columns after unstacking (as I have already done twice) and this seems a bit inflexible. Also, running a sum over axis=1 feels very hacky. I wonder if it can be improved and would be great to get some feedback on it. I also don't need the multiple column levels which always confuse me.

Here is my way:

import numpy as np
import pandas as pd

ids = [1,1,1,1,2,2,2,2,3,3,3,3,4,4,4,4,5,5,5,5]
times = [1,1,2,2,1,1,2,2,1,1,2,2,1,1,2,2,1,1,2,2]
states = ['A', 'B', 'A', 'B'] * 5

np.random.seed(121)
resps = [(i*t)   np.random.normal() for i, t in zip(ids, times)]

DATA = {
    'identity': ids,
    'time': times,
    'state': states,
    'resps': resps
}
df = pd.DataFrame(DATA)
df.head()

Gives us:

   identity  time state     resps
0         1     1     A  0.787967
1         1     1     B  0.715071
2         1     2     A  1.426102
3         1     2     B  1.559690
4         2     1     A  1.669889

Then making the differences in a complicated way:

wide = df.set_index(['identity', 'state', 'time']).unstack().reset_index()

A = wide['state']=='A'
B = wide['state']=='B'

wide['diffA'] = wide[A][('resps', 2)] - wide[A][('resps', 1)]
wide['diffB'] = wide[B][('resps', 2)] - wide[B][('resps', 1)]

wide['difference'] = wide[['diffA', 'diffB']].sum(axis=1)
wide = wide.drop(columns=[('diffA', ''), ('diffB', '')])
print(wide)

     identity state     resps            difference
time                        1          2           
0           1     A  0.787967   1.426102   0.638135
1           1     B  0.715071   1.559690   0.844619
2           2     A  1.669889   5.615373   3.945483
3           2     B  3.183695   4.367062   1.183368
4           3     A  2.985881   7.709641   4.723760
5           3     B  3.629642   4.673013   1.043371
6           4     A  4.401873   9.403826   5.001953
7           4     B  3.808573   6.031231   2.222658
8           5     A  4.209585  10.087744   5.878159
9           5     B  4.267278   9.499714   5.232436

CodePudding user response:

If you are mostly interested in just the difference column, below might help:

In [435]: df['difference'] = df.groupby(['identity', 'state'])['resps'].diff()

In [438]: df.dropna()[['identity', 'state', 'difference']]
Out[438]: 
    identity state  difference
2          1     A    0.638135
3          1     B    0.844619
6          2     A    3.945483
7          2     B    1.183368
10         3     A    4.723760
11         3     B    1.043371
14         4     A    5.001953
15         4     B    2.222658
18         5     A    5.878159
19         5     B    5.232436
  • Related