I have a dataframe structured like the example, df
, below. This contains 2 variables, time
and state
. Since these are repeated observations for identity
, I wanted to get a vector of differences between time
s 1 and 2, for each level of state
. I did this with the code below, but I feel like this should be an easier task than what I have done.
Is there some way to make this more efficient and reliable? For example if I have many levels for state
, I would need to manually do their difference columns after unstacking (as I have already done twice) and this seems a bit inflexible. Also, running a sum
over axis=1
feels very hacky. I wonder if it can be improved and would be great to get some feedback on it. I also don't need the multiple column levels which always confuse me.
Here is my way:
import numpy as np
import pandas as pd
ids = [1,1,1,1,2,2,2,2,3,3,3,3,4,4,4,4,5,5,5,5]
times = [1,1,2,2,1,1,2,2,1,1,2,2,1,1,2,2,1,1,2,2]
states = ['A', 'B', 'A', 'B'] * 5
np.random.seed(121)
resps = [(i*t) np.random.normal() for i, t in zip(ids, times)]
DATA = {
'identity': ids,
'time': times,
'state': states,
'resps': resps
}
df = pd.DataFrame(DATA)
df.head()
Gives us:
identity time state resps
0 1 1 A 0.787967
1 1 1 B 0.715071
2 1 2 A 1.426102
3 1 2 B 1.559690
4 2 1 A 1.669889
Then making the differences in a complicated way:
wide = df.set_index(['identity', 'state', 'time']).unstack().reset_index()
A = wide['state']=='A'
B = wide['state']=='B'
wide['diffA'] = wide[A][('resps', 2)] - wide[A][('resps', 1)]
wide['diffB'] = wide[B][('resps', 2)] - wide[B][('resps', 1)]
wide['difference'] = wide[['diffA', 'diffB']].sum(axis=1)
wide = wide.drop(columns=[('diffA', ''), ('diffB', '')])
print(wide)
identity state resps difference
time 1 2
0 1 A 0.787967 1.426102 0.638135
1 1 B 0.715071 1.559690 0.844619
2 2 A 1.669889 5.615373 3.945483
3 2 B 3.183695 4.367062 1.183368
4 3 A 2.985881 7.709641 4.723760
5 3 B 3.629642 4.673013 1.043371
6 4 A 4.401873 9.403826 5.001953
7 4 B 3.808573 6.031231 2.222658
8 5 A 4.209585 10.087744 5.878159
9 5 B 4.267278 9.499714 5.232436
CodePudding user response:
If you are mostly interested in just the difference
column, below might help:
In [435]: df['difference'] = df.groupby(['identity', 'state'])['resps'].diff()
In [438]: df.dropna()[['identity', 'state', 'difference']]
Out[438]:
identity state difference
2 1 A 0.638135
3 1 B 0.844619
6 2 A 3.945483
7 2 B 1.183368
10 3 A 4.723760
11 3 B 1.043371
14 4 A 5.001953
15 4 B 2.222658
18 5 A 5.878159
19 5 B 5.232436