I have an input dataframe that looks like this:
df = pd.DataFrame.from_dict({"t": [1,2,3,4,5], 'val': [100, 5, -4, -9, 1], })
I need to calculate the following 2 columns, one for the time since the last positive value, and one for the time since the last negative value:
df['t_since_neg'] = [np.nan, np.nan, np.nan, 1, 1]
df['t_since_pos'] = [np.nan, 1, 1,2,3]
The column t
stands for time. How do I do this? I know it would have something to do with diff
, but I couldn't get it to work exactly.
Update (follow up question): how would I do this if I have an additional column ‘id’, and the calculations need to be done within each group separately, ie each group is independent of each other?
CodePudding user response:
Solution
m = df.val > 0
df['t_since_neg'] = df['t'] - df['t'].where(~m).ffill().shift()
df['t_since_pos'] = df['t'] - df['t'].where( m).ffill().shift()
t val t_since_neg t_since_pos
0 1 100 NaN NaN
1 2 5 NaN 1.0
2 3 -4 NaN 1.0
3 4 -9 1.0 2.0
4 5 1 1.0 3.0
Explained
To calculate t_since_pos
, first mask the values in time column where the corresponding val
is negative, then forward fill and shift to propagate time corresponding to last positive value, finally subtract this from the original time column. The same approach can be used to calculate t_since_neg
>>> df['t'].where(m)
0 1.0
1 2.0
2 NaN
3 NaN
4 5.0
Name: t, dtype: float64
>>> .ffill().shift()
0 NaN
1 1.0
2 2.0
3 2.0
4 2.0
Name: t, dtype: float64