I have a dataframe such as :
Gps1 Gps2 start end
G1 GA 106 205
G1 GA 102 203
G1 GA 106 203
G1 GA 106 203
G2 GB 9 51
G2 GB 48 135
G2 GB 131 207
G2 GB 207 279
G3 GC 330 419
G3 GC 266 315
G3 GC 257 315
G3 GC 266 407
G4 GC 10 30
G4 GC 60 90
and I would like for each ['Gps1','Gps2']
groups to calculate for each row the number of overlapping coordinates between the first row end
- the end of the second row
.
So here is a detailed example :
for the first row of G1-GA
:
205-102 = 103
so I put 0.98 in the first row:
Gps1 Gps2 start end Nb_overlapping
G1 GA 106 205 103
G1 GA 102 203
G1 GA 106 203
G1 GA 106 203
then the second row:
203-106= 97
, so I fill it and so on for the others :
Gps1 Gps2 start end Nb_overlapping
G1 GA 106 205 103
G1 GA 102 203 97
G1 GA 106 203 113
G1 GA 90 210
The last row have to have the same value as the before-last value :
Gps1 Gps2 start end Nb_overlapping
G1 GA 106 205 103
G1 GA 102 203 97
G1 GA 106 203 113
G1 GA 106 203 113
Then for the group G2-GB
I do the same :
Gps1 Gps2 start end Nb_overlapping
G2 GB 9 51 3
G2 GB 48 135 4
G2 GB 131 207 0
G2 GB 207 279 0
At the end I should get :
Gps1 Gps2 start end
G1 GA 106 205 103
G1 GA 102 203 97
G1 GA 106 203 113
G1 GA 106 203 113
G2 GB 9 51 3
G2 GB 48 135 4
G2 GB 131 207 0
G2 GB 207 279 0
G3 GC 330 419 153
G3 GC 266 315 58
G3 GC 257 315 49
G3 GC 266 407 49
G4 GC 10 30 -30
G4 GC 60 90 -30
Does someone have an idea please ?
Here is the dict format of the dataframe if it can helps :
{'Gps1': {0: 'G1', 1: 'G1', 2: 'G1', 3: 'G1', 4: 'G2', 5: 'G2', 6: 'G2', 7: 'G2', 8: 'G3', 9: 'G3', 10: 'G3', 11: 'G3', 12: 'G4', 13: 'G4'}, 'Gps2': {0: 'GA', 1: 'GA', 2: 'GA', 3: 'GA', 4: 'GB', 5: 'GB', 6: 'GB', 7: 'GB', 8: 'GC', 9: 'GC', 10: 'GC', 11: 'GC', 12: 'GC', 13: 'GC'}, 'start': {0: 106, 1: 102, 2: 106, 3: 106, 4: 9, 5: 48, 6: 131, 7: 207, 8: 330, 9: 266, 10: 257, 11: 266, 12: 10, 13: 60}, 'end': {0: 205, 1: 203, 2: 203, 3: 203, 4: 51, 5: 135, 6: 207, 7: 279, 8: 419, 9: 315, 10: 315, 11: 407, 12: 30, 13: 90}}
CodePudding user response:
You can use a groupby.shift
and ffill
:
df['Nb_overlapping'] = (df['end'].sub(df.groupby(['Gps1', 'Gps2'])['start'].shift(-1))
.ffill(downcast='infer')
)
Output:
Gps1 Gps2 start end Nb_overlapping
0 G1 GA 106 205 103
1 G1 GA 102 203 97
2 G1 GA 106 203 97
3 G1 GA 106 203 97
4 G2 GB 9 51 3
5 G2 GB 48 135 4
6 G2 GB 131 207 0
7 G2 GB 207 279 0
8 G3 GC 330 419 153
9 G3 GC 266 315 58
10 G3 GC 257 315 49
11 G3 GC 266 407 49
12 G4 GC 10 30 -30
13 G4 GC 60 90 -30