How to calculate cumulative subtraction with a threshold and reset the subtraction after threshold w-CodePudding

This is a dataframe, with 4 columns. The primary dataframe contains two columns, trip and timestamps, and I calculated 'TimeDistance' which is the difference between rows of timestamps, and 'cum' which is the cumulative sum over TimeDistance column. in order to reach my goal, but I could not.

import pandas as pd
import numpy as np

data={'trip':[1,1,1,1,1,1,1,2,2,2,2,2,2,3,3,4,4,5,5,5],
      'timestamps':[1235471761, 1235471763, 1235471765, 1235471767, 1235471778, 1235471780, 1235471782, 1235471784, 1235471786, 1235471788,1235471820,1235471826,1235471829,1235471890,1235471893,1235471894,1235471896,1235471900,1235471910,1235471912]}

df = pd.DataFrame(data)
df['TimeDistance'] = df.groupby('trip')['timestamps'].diff(1)
df['cum']=df.groupby('trip')['TimeDistance'].cumsum()
df

this is the output:

    trip    timestamps  TimeDistance    cum
0   1   1235471761  NaN NaN
1   1   1235471763  2.0 2.0
2   1   1235471765  2.0 4.0
3   1   1235471767  2.0 6.0
4   1   1235471778  11.0    17.0
5   1   1235471780  2.0 19.0
6   1   1235471782  2.0 21.0
7   2   1235471784  NaN NaN
8   2   1235471786  2.0 2.0
9   2   1235471788  2.0 4.0
10  2   1235471820  32.0    36.0
11  2   1235471826  6.0 42.0
12  2   1235471829  3.0 45.0
13  3   1235471890  NaN NaN
14  3   1235471893  3.0 3.0
15  4   1235471894  NaN NaN
16  4   1235471896  2.0 2.0
17  5   1235471900  NaN NaN
18  5   1235471910  10.0    10.0
19  5   1235471912  2.0 12.0

This output is not my desired output, I want to subtract each row of the timestamp column from the first row for each trip, store it in a new column (cum), and whenever it reaches 10, do these for the next rows:

reset the subtraction,
the next row after the row in which the threshold is reached will be considered as the origin and it must be equal to zero,
continue subtraction from this row (which is equal to zero) and subsequent rows again until we reach 10.
Whenever we reach the end of a trip, the subtraction will also reset for a new trip.
Repeat this procedure for all trips.

for example, in row 4, we have reached to threshold because the value in 'cum' column is 17, so, the next row in the 'cum' column must be 0 (but it is 19) and for row 6, we have to calculate the difference between timestamps in row 5, 6 that should be 2, not 19!

for more clarity, I have attached a screenshot from my desired output

CodePudding user response：

You can use a mask to reset the cumsum:

df['TimeDistance'] = df.groupby('trip')['timestamps'].diff(1)

# get rows above threshold
m = df['TimeDistance'].gt(10).groupby(df['trip']).shift(fill_value=False)

df['cum'] = (df['TimeDistance']
             .mask(m, 0)
             .groupby([df['trip'], m.cumsum()])
             .cumsum()
            )

output:

    trip  timestamps  TimeDistance   cum
0      1  1235471761           NaN   NaN
1      1  1235471763           2.0   2.0
2      1  1235471765           2.0   4.0
3      1  1235471767           2.0   6.0
4      1  1235471778          11.0  17.0
5      1  1235471780           2.0   0.0
6      1  1235471782           2.0   2.0
7      2  1235471784           NaN   NaN
8      2  1235471786           2.0   2.0
9      2  1235471788           2.0   4.0
10     2  1235471820          32.0  36.0
11     2  1235471826           6.0   0.0
12     2  1235471829           3.0   3.0
13     3  1235471890           NaN   NaN
14     3  1235471893           3.0   3.0
15     4  1235471894           NaN   NaN
16     4  1235471896           2.0   2.0
17     5  1235471900           NaN   NaN
18     5  1235471910          10.0  10.0
19     5  1235471912           2.0  12.0

CodePudding user response：

IIUC, you can just write your own function that iterates over your differences and updates the cumulative sum of the differences if it is greater than your threshold.

Given your data:

import pandas as pd
import numpy as np

data={'trip':[1,1,1,1,1,1,1,2,2,2,2,2,2,3,3,4,4,5,5,5],
      'timestamps':[1235471761, 1235471763, 1235471765, 1235471767, 1235471778, 1235471780, 1235471782, 1235471784, 1235471786, 1235471788,1235471820,1235471826,1235471829,1235471890,1235471893,1235471894,1235471896,1235471900,1235471910,1235471912]}

df = pd.DataFrame(data)

You can do the following:

ser = df.groupby('trip').diff().fillna(0)['timestamps']

def cumsum_with_threshold(ser: pd.Series, threshold: float = 10.) -> list:
    cumsum = 0
    cumsum_ls = []
    for i in range(len(ser)):
        cumsum  = ser[i]
        cumsum_ls.append(cumsum)
        if cumsum.__ge__(threshold): 
            cumsum = 0 
            ser[i 1] = 0
    return cumsum_ls

df['cum'] = cumsum_with_threshold(ser)

Which results into this:

   trip timestamps  cum
0   1   1235471761  0.0
1   1   1235471763  2.0
2   1   1235471765  4.0
3   1   1235471767  6.0
4   1   1235471778  17.0
5   1   1235471780  0.0
6   1   1235471782  2.0
7   2   1235471784  2.0
8   2   1235471786  4.0
9   2   1235471788  6.0
10  2   1235471820  38.0
11  2   1235471826  0.0
12  2   1235471829  3.0
13  3   1235471890  3.0
14  3   1235471893  6.0
15  4   1235471894  6.0
16  4   1235471896  8.0
17  5   1235471900  8.0
18  5   1235471910  18.0
19  5   1235471912  0.0
```