I have a dataframe such as:
Groups Scaff start end
G1 Scaff1 2278 4437
G1 Scaff1 2788 3048
G1 Scaff1 3959 4183
G1 Scaff1 4201 4407
G1 Scaff2 4553 5000
G1 Scaff2 6321 7700
G1 Scaff3 2870 5083
G1 Scaff4 1923 2042
G1 Scaff5 663 2885
G1 Scaff5 2145 2825
And I would like to add groups for each Grous-Scaff
overlapping coordinates.
Lets first take the G1-Scaff1
as an example:
Groups Scaff start end
G1 Scaff1 2278 4437
G1 Scaff1 2788 3048
G1 Scaff1 3959 4183
G1 Scaff1 4201 4407
as you can see all coordinates overlap with each other:
2278 - 4437
overlaps with2788 - 3048
2788 - 3048
overlaps with3959 - 4183
3959 - 4183
overlaps with4201 - 4407
so I group them all within the same Groups1:
Groups Scaff start end New_group
G1 Scaff1 2278 4437 G1
G1 Scaff1 2788 3048 G1
G1 Scaff1 3959 4183 G1
G1 Scaff1 4201 4407 G1
When I say overlap I mean in that way for instance if we compare 1-10 and 3-7 it would give an overlap of 4.
For the other example in G1 - Scaff2
: there is no overlap, then I put them in two different Groups
Groups Scaff start end New_group
G1 Scaff2 4553 5000 G2
G1 Scaff2 6321 7700 G3
I should then get overall:
Groups Scaff start end New_group
G1 Scaff1 2278 4437 G1
G1 Scaff1 2788 3048 G1
G1 Scaff1 3959 4183 G1
G1 Scaff1 4201 4407 G1
G1 Scaff2 4553 5000 G2
G1 Scaff2 6321 7700 G3
G1 Scaff3 2870 5083 G4
G1 Scaff4 1923 2042 G5
G1 Scaff5 663 2885 G6
G1 Scaff5 2145 2825 G6
So far I tried the following code:
is_overlapped = lambda x: x['start'] >= x['end'].shift(fill_value=-1)
tab['New_group'] = tab.sort_values(['Groups','Scaff','start','end']).groupby(['Groups','Scaff'],as_index=False).apply(is_overlapped).droplevel(0).cumsum()
Which gives:
Groups Scaff start end New_group
0 G1 Scaff1 2278 4437.0 1
1 G1 Scaff1 2788 3048.0 1
2 G1 Scaff1 3959 4183.0 2
3 G1 Scaff1 4201 4407.0 3
4 G1 Scaff2 4553 5000.0 4
5 G1 Scaff2 6321 7700.0 5
6 G1 Scaff3 2870 5083.0 6
7 G1 Scaff4 1923 2042 7
8 G1 Scaff5 663 2885 9
9 G1 Scaff5 2145 2825.0 8
and as you can see, rows 0,1,2 and 3 should all be in the same New_group
...
Here is the dataframe in dict format if it can helps :
{'Groups': {0: 'G1', 1: 'G1', 2: 'G1', 3: 'G1', 4: 'G1', 5: 'G1', 6: 'G1', 7: 'G1', 8: 'G1', 9: 'G1'}, 'Scaff': {0: 'Scaff1', 1: 'Scaff1', 2: 'Scaff1', 3: 'Scaff1', 4: 'Scaff2', 5: 'Scaff2', 6: 'Scaff3', 7: 'Scaff4', 8: 'Scaff5', 9: 'Scaff5'}, 'start': {0: 2278, 1: 2788, 2: 3959, 3: 4201, 4: 4553, 5: 6321, 6: 2870, 7: 1923, 8: 663, 9: 2145}, 'end': {0: 4437, 1: 3048, 2: 4183, 3: 4407, 4: 5000, 5: 7700, 6: 5083, 7: 2042, 8: 2885, 9: 2825}}
CodePudding user response:
Compare to the shifted cummax
"end" per group:
g = df.groupby(['Groups', 'Scaff'], group_keys=False)
group = df['start'].gt(g['end'].apply(lambda s: s.shift().cummax())).cumsum()
df['New_group'] = (df.groupby(['Groups', 'Scaff', group])
.ngroup().add(1).astype(str)
.radd('G')
)
Output:
Groups Scaff start end New_group
0 G1 Scaff1 2278 4437 G1
1 G1 Scaff1 2788 3048 G1
2 G1 Scaff1 3959 4183 G1
3 G1 Scaff1 4201 4407 G1
4 G1 Scaff2 4553 5000 G2
5 G1 Scaff2 6321 7700 G3
6 G1 Scaff3 2870 5083 G4
7 G1 Scaff4 1923 2042 G5
8 G1 Scaff5 663 2885 G6
9 G1 Scaff5 2145 2825 G6