Group columns if coordinates are not more distant than a threshold-CodePudding

Sps       Gps start  end
SP1       G1      2    322
SP1       G1    318   1368
SP1       G1  21125  22297
SP2       G2      2    313
SP2       G2    334   1359
SP2       G2  11716  11964
SP2       G2  20709  20885
SP2       G2  21080  22297
SP3       G3      2    313
SP3       G3    328   1368
SP3       G3  21116  22294 
SP4       G4    346   1356
SP4       G4  21131  22282

and I would like to add a new columns Threshold_gps for each Sps and Gps that have start and end next to each others but where the distance length (end-start) is below a threshold of 500.

Let's take examples:

SP1-G1

Sps       Gps start  end
SP1       G1      2    322
SP1       G1    318   1368
SP1       G1  21125  22297

here 318-322=-4 which is < 500 so I group them

Sps       Gps start  end    Threshold_gps
SP1       G1      2    322  G1
SP1       G1    318   1368  G1
SP1       G1  21125  22297

then, 21125-1368=19757 which is > 500 so I do not group them

Sps       Gps start  end    Threshold_gps
SP1       G1      2    322  G1
SP1       G1    318   1368  G1
SP1       G1  21125  22297  G2

SP2-G2

Sps       Gps start  end    Threshold_gps
SP2       G2      2    313
SP2       G2    334   1359
SP2       G2  11716  11964
SP2       G2  20709  20885
SP2       G2  21080  22297

334-313=21 which is < 500 so I group them

Sps       Gps start  end    Threshold_gps
SP2       G2      2    313  G1
SP2       G2    334   1359  G1
SP2       G2  11716  11964
SP2       G2  20709  20885
SP2       G2  21080  22297

then, 11716-1359=10357 which is > 500 so I do not group them

Sps       Gps start  end    Threshold_gps
SP2       G2      2    313  G1
SP2       G2    334   1359  G1
SP2       G2  11716  11964  G2
SP2       G2  20709  20885
SP2       G2  21080  22297

then, 20709-11964=8745 which is > 500 so I do not group them

Sps       Gps start  end    Threshold_gps
SP2       G2      2    313  G1
SP2       G2    334   1359  G1
SP2       G2  11716  11964  G2
SP2       G2  20709  20885  G3 
SP2       G2  21080  22297

then, 21080-20885=195 which is < 500 so I group them

Sps       Gps start  end    Threshold_gps
SP2       G2      2    313  G1
SP2       G2    334   1359  G1
SP2       G2  11716  11964  G2
SP2       G2  20709  20885  G3 
SP2       G2  21080  22297  G3

and so on..

Sps       Gps start  end    Threshold_gps
SP1       G1      2    322  G1
SP1       G1    318   1368  G1
SP1       G1  21125  22297  G2 
SP2       G2      2    313  G1
SP2       G2    334   1359  G1
SP2       G2  11716  11964  G2
SP2       G2  20709  20885  G3 
SP2       G2  21080  22297  G3
SP3       G3      2    313  G1
SP3       G3    328   1368  G1
SP3       G3  21116  22294  G2
SP4       G4    346   1356  G1
SP4       G4  21131  22282  G2

Does someone have an idea please?

Here is the dict format of the tab if it can helps:

{'Sps': {0: 'SP1', 1: 'SP1', 2: 'SP1', 3: 'SP2', 4: 'SP2', 5: 'SP2', 6: 'SP2', 7: 'SP2', 8: 'SP3', 9: 'SP3', 10: 'SP3', 11: 'SP4', 12: 'SP4'}, 'Gps': {0: 'G1', 1: 'G1', 2: 'G1', 3: 'G2', 4: 'G2', 5: 'G2', 6: 'G2', 7: 'G2', 8: 'G3', 9: 'G3', 10: 'G3', 11: 'G4', 12: 'G4'}, 'start': {0: 2, 1: 318, 2: 21125, 3: 2, 4: 334, 5: 11716, 6: 20709, 7: 21080, 8: 2, 9: 328, 10: 21116, 11: 346, 12: 21131}, 'end': {0: 322, 1: 1368, 2: 22297, 3: 313, 4: 1359, 5: 11964, 6: 20885, 7: 22297, 8: 313, 9: 1368, 10: 22294, 11: 1356, 12: 22282}}

CodePudding user response：

I believe you might want:

df['Threshold_gps'] = (df
   .groupby(['Sps', 'Gps'], group_keys=False)
   .apply(lambda d: (s:=d['end'].shift().rsub(d['start'])
                     .gt(500))
                     .cumsum().add(1-s.iloc[0])
                     .astype(str).radd('G')
         )
)

for python <3.8:

def get_group(g):
    s = g['end'].shift().rsub(g['start']).gt(500)
    return s.cumsum().add(1-s.iloc[0]).astype(str).radd('G')

df['Threshold_gps'] = (df
   .groupby(['Sps', 'Gps'], group_keys=False)
   .apply(get_group)
)

Output:

    Sps Gps  start    end Threshold_gps
0   SP1  G1      2    322            G1
1   SP1  G1    318   1368            G1
2   SP1  G1  21125  22297            G2
3   SP2  G2      2    313            G1
4   SP2  G2    334   1359            G1
5   SP2  G2  11716  11964            G2
6   SP2  G2  20709  20885            G3
7   SP2  G2  21080  22297            G3
8   SP3  G3      2    313            G1
9   SP3  G3    328   1368            G1
10  SP3  G3  21116  22294            G2
11  SP4  G4    346   1356            G1
12  SP4  G4  21131  22282            G2