This is a dataframe, with 4 columns. The primary dataframe contains two columns, trip and timestamps, and I calculated 'TimeDistance' which is the difference between rows of timestamps, and 'cum' which is the cumulative sum over TimeDistance column. in order to reach my goal, but I could not.
import pandas as pd
import numpy as np
data={'trip':[1,1,1,1,1,1,1,2,2,2,2,2,2,3,3,4,4,5,5,5],
'timestamps':[1235471761, 1235471763, 1235471765, 1235471767, 1235471778, 1235471780, 1235471782, 1235471784, 1235471786, 1235471788,1235471820,1235471826,1235471829,1235471890,1235471893,1235471894,1235471896,1235471900,1235471910,1235471912]}
df = pd.DataFrame(data)
df['TimeDistance'] = df.groupby('trip')['timestamps'].diff(1)
df['cum']=df.groupby('trip')['TimeDistance'].cumsum()
df
this is the output:
trip timestamps TimeDistance cum
0 1 1235471761 NaN NaN
1 1 1235471763 2.0 2.0
2 1 1235471765 2.0 4.0
3 1 1235471767 2.0 6.0
4 1 1235471778 11.0 17.0
5 1 1235471780 2.0 19.0
6 1 1235471782 2.0 21.0
7 2 1235471784 NaN NaN
8 2 1235471786 2.0 2.0
9 2 1235471788 2.0 4.0
10 2 1235471820 32.0 36.0
11 2 1235471826 6.0 42.0
12 2 1235471829 3.0 45.0
13 3 1235471890 NaN NaN
14 3 1235471893 3.0 3.0
15 4 1235471894 NaN NaN
16 4 1235471896 2.0 2.0
17 5 1235471900 NaN NaN
18 5 1235471910 10.0 10.0
19 5 1235471912 2.0 12.0
This output is not my desired output, I want to subtract each row of the timestamp column from the first row for each trip, store it in a new column (cum), and whenever it reaches 10, do these for the next rows:
- reset the subtraction,
- the next row after the row in which the threshold is reached will be considered as the origin and it must be equal to zero,
- continue subtraction from this row (which is equal to zero) and subsequent rows again until we reach 10.
- Whenever we reach the end of a trip, the subtraction will also reset for a new trip.
- Repeat this procedure for all trips.
for example, in row 4, we have reached to threshold because the value in 'cum' column is 17, so, the next row in the 'cum' column must be 0 (but it is 19) and for row 6, we have to calculate the difference between timestamps in row 5, 6 that should be 2, not 19!
CodePudding user response:
You can use a mask to reset the cumsum:
df['TimeDistance'] = df.groupby('trip')['timestamps'].diff(1)
# get rows above threshold
m = df['TimeDistance'].gt(10).groupby(df['trip']).shift(fill_value=False)
df['cum'] = (df['TimeDistance']
.mask(m, 0)
.groupby([df['trip'], m.cumsum()])
.cumsum()
)
output:
trip timestamps TimeDistance cum
0 1 1235471761 NaN NaN
1 1 1235471763 2.0 2.0
2 1 1235471765 2.0 4.0
3 1 1235471767 2.0 6.0
4 1 1235471778 11.0 17.0
5 1 1235471780 2.0 0.0
6 1 1235471782 2.0 2.0
7 2 1235471784 NaN NaN
8 2 1235471786 2.0 2.0
9 2 1235471788 2.0 4.0
10 2 1235471820 32.0 36.0
11 2 1235471826 6.0 0.0
12 2 1235471829 3.0 3.0
13 3 1235471890 NaN NaN
14 3 1235471893 3.0 3.0
15 4 1235471894 NaN NaN
16 4 1235471896 2.0 2.0
17 5 1235471900 NaN NaN
18 5 1235471910 10.0 10.0
19 5 1235471912 2.0 12.0
CodePudding user response:
IIUC, you can just write your own function that iterates over your differences and updates the cumulative sum of the differences if it is greater than your threshold.
Given your data:
import pandas as pd
import numpy as np
data={'trip':[1,1,1,1,1,1,1,2,2,2,2,2,2,3,3,4,4,5,5,5],
'timestamps':[1235471761, 1235471763, 1235471765, 1235471767, 1235471778, 1235471780, 1235471782, 1235471784, 1235471786, 1235471788,1235471820,1235471826,1235471829,1235471890,1235471893,1235471894,1235471896,1235471900,1235471910,1235471912]}
df = pd.DataFrame(data)
You can do the following:
ser = df.groupby('trip').diff().fillna(0)['timestamps']
def cumsum_with_threshold(ser: pd.Series, threshold: float = 10.) -> list:
cumsum = 0
cumsum_ls = []
for i in range(len(ser)):
cumsum = ser[i]
cumsum_ls.append(cumsum)
if cumsum.__ge__(threshold):
cumsum = 0
ser[i 1] = 0
return cumsum_ls
df['cum'] = cumsum_with_threshold(ser)
Which results into this:
trip timestamps cum
0 1 1235471761 0.0
1 1 1235471763 2.0
2 1 1235471765 4.0
3 1 1235471767 6.0
4 1 1235471778 17.0
5 1 1235471780 0.0
6 1 1235471782 2.0
7 2 1235471784 2.0
8 2 1235471786 4.0
9 2 1235471788 6.0
10 2 1235471820 38.0
11 2 1235471826 0.0
12 2 1235471829 3.0
13 3 1235471890 3.0
14 3 1235471893 6.0
15 4 1235471894 6.0
16 4 1235471896 8.0
17 5 1235471900 8.0
18 5 1235471910 18.0
19 5 1235471912 0.0
```