I have a dataframe with start and end positions. I want to squash rows, where end_n
is close to start_n 1
and add the corresponding values. In the end I want the cumulative sum and the start and end values from which that sum came.
Example indata, allowing a distance of <5 from end_n
to start_n 1
:
start end value
1 0 10 3
2 11 15 4
3 17 20 5
4 45 50 3
5 51 60 13
6 100 120 9
Desired result:
start end value
1 0 10 3
2 11 15 4
3 17 20 5
4 45 50 3
5 51 60 13
6 100 120 9
or
start end sum
1 0 20 12
4 45 60 16
6 100 120 9
I suppose a lambda function would do it, but the original data is large and that would impact performance. I would prefer a pure pandas/numpy solution.
CodePudding user response:
Subtract shifted values and comapre if greater like 5
with cumulative sums for groups and then aggregate by GroupBy.agg
:
g = df['start'].sub(df['end'].shift(fill_value=0)).gt(5).cumsum()
df = df.groupby(g).agg(start=('start', 'first'), end=('end','last'), sum=('value','sum'))
print (df)
start end sum
0 0 20 12
1 45 60 16
2 100 120 9