DataFrame - time since last positive and last negative value-CodePudding

I have an input dataframe that looks like this:

df = pd.DataFrame.from_dict({"t": [1,2,3,4,5], 'val': [100, 5, -4, -9, 1], })

I need to calculate the following 2 columns, one for the time since the last positive value, and one for the time since the last negative value:

df['t_since_neg'] = [np.nan, np.nan, np.nan, 1, 1]
df['t_since_pos'] =  [np.nan, 1, 1,2,3]

The column t stands for time. How do I do this? I know it would have something to do with diff, but I couldn't get it to work exactly.

Update (follow up question): how would I do this if I have an additional column ‘id’, and the calculations need to be done within each group separately, ie each group is independent of each other?

CodePudding user response：

Solution

m = df.val > 0

df['t_since_neg'] = df['t'] - df['t'].where(~m).ffill().shift()
df['t_since_pos'] = df['t'] - df['t'].where( m).ffill().shift()

   t  val  t_since_neg  t_since_pos
0  1  100          NaN          NaN
1  2    5          NaN          1.0
2  3   -4          NaN          1.0
3  4   -9          1.0          2.0
4  5    1          1.0          3.0

Explained

To calculate t_since_pos, first mask the values in time column where the corresponding val is negative, then forward fill and shift to propagate time corresponding to last positive value, finally subtract this from the original time column. The same approach can be used to calculate t_since_neg

>>>  df['t'].where(m)
0    1.0
1    2.0
2    NaN
3    NaN
4    5.0
Name: t, dtype: float64

>>> .ffill().shift()
0    NaN
1    1.0
2    2.0
3    2.0
4    2.0
Name: t, dtype: float64