Trying to calculate difference in time based on user groups and last occurrence of flag == 1:
df = pd.DataFrame({'user': ['x','x','x','x','x','y','y','y','y','y'],
'Flag': [0,0,1,0,1,0,1,0,1,0],
'time': [10, 34, 40, 43, 44, 12, 20, 46, 51, 71]})
And I am calculating the diff from the last 1's
(df.assign(mask=df['Flag'].eq(1),
group=lambda d: d.groupby('user')['mask'].cumsum(),
# diff from last 1
diff=lambda d: d.groupby(['user', 'group'])['time'].apply(lambda g: g -(g.iloc[0] if g.name[1]>0 else float('nan'))),
)
# mask 1s with their own diff
.assign(diff= lambda d: d['diff'].mask(d['mask'],
(d[d['mask'].groupby(d['user']).cumsum().eq(0)|d['mask']].groupby('user')['time'].diff())
)
)
)
For some reason I don't get NaN values (as I expect) on indexes 2 and 6:
user Flag time mask group diff
0 x 0 10 False 0 NaN
1 x 0 34 False 0 NaN
2 x 1 40 True 1 6.0
3 x 0 43 False 1 3.0
4 x 1 44 True 2 4.0
5 y 0 12 False 0 NaN
6 y 1 20 True 1 8.0
7 y 0 46 False 1 26.0
8 y 1 51 True 2 31.0
9 y 0 71 False 2 20.0
Trying with df.loc[[2,6],'diff'] = 1000000000
But getting:
user Flag time diff
0 x 0 10 NaN
1 x 0 34 NaN
2 x 1 40 100000000.0
3 x 0 43 NaN
4 x 1 44 NaN
5 y 0 12 NaN
6 y 1 20 100000000.0
7 y 0 46 NaN
8 y 1 51 NaN
9 y 0 71 NaN
CodePudding user response:
IIUC, the idea is once you replace the time by nan where
the Flag is not 1, then you can groupby
the user and combine shift
and ffill
to report the value of the last 1's on the following rows. Then substract this to the original time column
df['diff'] = (
df['time']
- df['time'].where(df['Flag'].astype(bool))
.groupby(df['user'])
.apply(lambda x: x.shift().ffill())
)
print(df)
user Flag time diff
0 x 0 10 NaN
1 x 0 34 NaN
2 x 1 40 NaN
3 x 0 43 3.0
4 x 1 44 4.0
5 y 0 12 NaN
6 y 1 20 NaN
7 y 0 46 26.0
8 y 1 51 31.0
9 y 0 71 20.0