Sps Gps start end
SP1 G1 2 322
SP1 G1 318 1368
SP1 G1 21125 22297
SP2 G2 2 313
SP2 G2 334 1359
SP2 G2 11716 11964
SP2 G2 20709 20885
SP2 G2 21080 22297
SP3 G3 2 313
SP3 G3 328 1368
SP3 G3 21116 22294
SP4 G4 346 1356
SP4 G4 21131 22282
and I would like to add a new columns Threshold_gps
for each Sps
and Gps
that have start and end next to each others but where the distance length (end-start) is below a threshold of 500.
Let's take examples:
SP1-G1
Sps Gps start end
SP1 G1 2 322
SP1 G1 318 1368
SP1 G1 21125 22297
here 318-322=-4
which is < 500
so I group them
Sps Gps start end Threshold_gps
SP1 G1 2 322 G1
SP1 G1 318 1368 G1
SP1 G1 21125 22297
then, 21125-1368=19757
which is > 500
so I do not group them
Sps Gps start end Threshold_gps
SP1 G1 2 322 G1
SP1 G1 318 1368 G1
SP1 G1 21125 22297 G2
SP2-G2
Sps Gps start end Threshold_gps
SP2 G2 2 313
SP2 G2 334 1359
SP2 G2 11716 11964
SP2 G2 20709 20885
SP2 G2 21080 22297
334-313=21
which is < 500
so I group them
Sps Gps start end Threshold_gps
SP2 G2 2 313 G1
SP2 G2 334 1359 G1
SP2 G2 11716 11964
SP2 G2 20709 20885
SP2 G2 21080 22297
then, 11716-1359=10357
which is > 500
so I do not group them
Sps Gps start end Threshold_gps
SP2 G2 2 313 G1
SP2 G2 334 1359 G1
SP2 G2 11716 11964 G2
SP2 G2 20709 20885
SP2 G2 21080 22297
then, 20709-11964=8745
which is > 500
so I do not group them
Sps Gps start end Threshold_gps
SP2 G2 2 313 G1
SP2 G2 334 1359 G1
SP2 G2 11716 11964 G2
SP2 G2 20709 20885 G3
SP2 G2 21080 22297
then, 21080-20885=195
which is < 500
so I group them
Sps Gps start end Threshold_gps
SP2 G2 2 313 G1
SP2 G2 334 1359 G1
SP2 G2 11716 11964 G2
SP2 G2 20709 20885 G3
SP2 G2 21080 22297 G3
and so on..
Sps Gps start end Threshold_gps
SP1 G1 2 322 G1
SP1 G1 318 1368 G1
SP1 G1 21125 22297 G2
SP2 G2 2 313 G1
SP2 G2 334 1359 G1
SP2 G2 11716 11964 G2
SP2 G2 20709 20885 G3
SP2 G2 21080 22297 G3
SP3 G3 2 313 G1
SP3 G3 328 1368 G1
SP3 G3 21116 22294 G2
SP4 G4 346 1356 G1
SP4 G4 21131 22282 G2
Does someone have an idea please?
Here is the dict format of the tab if it can helps:
{'Sps': {0: 'SP1', 1: 'SP1', 2: 'SP1', 3: 'SP2', 4: 'SP2', 5: 'SP2', 6: 'SP2', 7: 'SP2', 8: 'SP3', 9: 'SP3', 10: 'SP3', 11: 'SP4', 12: 'SP4'}, 'Gps': {0: 'G1', 1: 'G1', 2: 'G1', 3: 'G2', 4: 'G2', 5: 'G2', 6: 'G2', 7: 'G2', 8: 'G3', 9: 'G3', 10: 'G3', 11: 'G4', 12: 'G4'}, 'start': {0: 2, 1: 318, 2: 21125, 3: 2, 4: 334, 5: 11716, 6: 20709, 7: 21080, 8: 2, 9: 328, 10: 21116, 11: 346, 12: 21131}, 'end': {0: 322, 1: 1368, 2: 22297, 3: 313, 4: 1359, 5: 11964, 6: 20885, 7: 22297, 8: 313, 9: 1368, 10: 22294, 11: 1356, 12: 22282}}
CodePudding user response:
I believe you might want:
df['Threshold_gps'] = (df
.groupby(['Sps', 'Gps'], group_keys=False)
.apply(lambda d: (s:=d['end'].shift().rsub(d['start'])
.gt(500))
.cumsum().add(1-s.iloc[0])
.astype(str).radd('G')
)
)
for python <3.8:
def get_group(g):
s = g['end'].shift().rsub(g['start']).gt(500)
return s.cumsum().add(1-s.iloc[0]).astype(str).radd('G')
df['Threshold_gps'] = (df
.groupby(['Sps', 'Gps'], group_keys=False)
.apply(get_group)
)
Output:
Sps Gps start end Threshold_gps
0 SP1 G1 2 322 G1
1 SP1 G1 318 1368 G1
2 SP1 G1 21125 22297 G2
3 SP2 G2 2 313 G1
4 SP2 G2 334 1359 G1
5 SP2 G2 11716 11964 G2
6 SP2 G2 20709 20885 G3
7 SP2 G2 21080 22297 G3
8 SP3 G3 2 313 G1
9 SP3 G3 328 1368 G1
10 SP3 G3 21116 22294 G2
11 SP4 G4 346 1356 G1
12 SP4 G4 21131 22282 G2