Add overlapping coordinates within groups in pandas-CodePudding

I have a dataframe such as :

Gps1    Gps2   start end
G1      GA     106   205
G1      GA     102   203
G1      GA     106   203
G1      GA     106   203
G2      GB       9    51
G2      GB      48   135
G2      GB     131   207
G2      GB     207   279
G3      GC     330   419
G3      GC     266   315
G3      GC     257   315
G3      GC     266   407
G4      GC     10    30
G4      GC     60    90

and I would like for each ['Gps1','Gps2'] groups to calculate for each row the number of overlapping coordinates between the first row end - the end of the second row.

So here is a detailed example :

for the first row of G1-GA:

205-102 = 103

so I put 0.98 in the first row:

Gps1    Gps2   start end  Nb_overlapping
G1      GA     106   205  103
G1      GA     102   203  
G1      GA     106   203
G1      GA     106   203

then the second row:

203-106= 97, so I fill it and so on for the others :

Gps1    Gps2   start end  Nb_overlapping
G1      GA     106   205  103
G1      GA     102   203  97
G1      GA     106   203  113
G1      GA     90    210

The last row have to have the same value as the before-last value :

Gps1    Gps2   start end  Nb_overlapping
G1      GA     106   205  103
G1      GA     102   203  97
G1      GA     106   203  113
G1      GA     106   203  113

Then for the group G2-GB I do the same :

Gps1    Gps2   start end  Nb_overlapping
G2      GB       9    51  3
G2      GB      48   135  4
G2      GB     131   207  0
G2      GB     207   279  0

At the end I should get :

Gps1    Gps2   start end
G1      GA     106   205  103
G1      GA     102   203  97
G1      GA     106   203  113
G1      GA     106   203  113
G2      GB       9    51  3
G2      GB      48   135  4
G2      GB     131   207  0
G2      GB     207   279  0
G3      GC     330   419  153
G3      GC     266   315  58
G3      GC     257   315  49
G3      GC     266   407  49
G4      GC     10    30   -30
G4      GC     60    90   -30

Does someone have an idea please ?

Here is the dict format of the dataframe if it can helps :

{'Gps1': {0: 'G1', 1: 'G1', 2: 'G1', 3: 'G1', 4: 'G2', 5: 'G2', 6: 'G2', 7: 'G2', 8: 'G3', 9: 'G3', 10: 'G3', 11: 'G3', 12: 'G4', 13: 'G4'}, 'Gps2': {0: 'GA', 1: 'GA', 2: 'GA', 3: 'GA', 4: 'GB', 5: 'GB', 6: 'GB', 7: 'GB', 8: 'GC', 9: 'GC', 10: 'GC', 11: 'GC', 12: 'GC', 13: 'GC'}, 'start': {0: 106, 1: 102, 2: 106, 3: 106, 4: 9, 5: 48, 6: 131, 7: 207, 8: 330, 9: 266, 10: 257, 11: 266, 12: 10, 13: 60}, 'end': {0: 205, 1: 203, 2: 203, 3: 203, 4: 51, 5: 135, 6: 207, 7: 279, 8: 419, 9: 315, 10: 315, 11: 407, 12: 30, 13: 90}}

CodePudding user response：

You can use a groupby.shift and ffill:

df['Nb_overlapping'] = (df['end'].sub(df.groupby(['Gps1', 'Gps2'])['start'].shift(-1))
                        .ffill(downcast='infer')
                       )

Output:

   Gps1 Gps2  start  end  Nb_overlapping
0    G1   GA    106  205             103
1    G1   GA    102  203              97
2    G1   GA    106  203              97
3    G1   GA    106  203              97
4    G2   GB      9   51               3
5    G2   GB     48  135               4
6    G2   GB    131  207               0
7    G2   GB    207  279               0
8    G3   GC    330  419             153
9    G3   GC    266  315              58
10   G3   GC    257  315              49
11   G3   GC    266  407              49
12   G4   GC     10   30             -30
13   G4   GC     60   90             -30