This is a toy dataset:
df = pd.DataFrame({'ID': ['A','A','A','A'],
'target': ['B','B','B','B'],
'length':[208,315,1987,3775],
'start':[139403,140668,141726,143705],
'end':[139609,140982,143711,147467]})
ID target length start end
0 A B 208 139403 139609
1 A B 315 140668 140982
2 A B 1987 141726 143711
3 A B 3775 143705 147467
I need to sum the length column taking overlapping ranges in start and end into consideration.
In row 3, the start value 143705 is between the ranges of row 2's 141726-143711.
143711 - 143705 = 6 so I need to subtract 6 from the length in row 3: 3775 - 6 = 3769.
Which would change the df to:
ID target length start end
0 A B 208 139403 139609
1 A B 315 140668 140982
2 A B 1987 141726 143711
3 A B 3769 143705 147467
Then grouping by ID, target and summing length:
df.groupby(['ID','target'])['length'].sum()
ID target
A B 6279
Name: length, dtype: int64
Does anyone know how I can do this in a pythonic way? Thanks so much for any help!
CodePudding user response:
A possible solution:
(df.assign(length=
df['start'].lt(df['end'].shift())
.mul(df['start']-df['end'].shift(fill_value=0))
.add(df['length'])))
Output:
ID target length start end
0 A B 208 139403 139609
1 A B 315 140668 140982
2 A B 1987 141726 143711
3 A B 3769 143705 147467