Home > other >  Group overlaping start-end coordinates within groups in pandas
Group overlaping start-end coordinates within groups in pandas

Time:12-14

I have a dataframe such as:

Groups Scaff  start    end
G1     Scaff1 2278     4437
G1     Scaff1 2788     3048
G1     Scaff1 3959     4183
G1     Scaff1 4201     4407
G1     Scaff2 4553     5000
G1     Scaff2 6321     7700
G1     Scaff3 2870     5083
G1     Scaff4 1923     2042
G1     Scaff5 663      2885
G1     Scaff5 2145     2825

And I would like to add groups for each Grous-Scaff overlapping coordinates.

Lets first take the G1-Scaff1 as an example:

Groups Scaff  start    end
G1     Scaff1 2278     4437
G1     Scaff1 2788     3048
G1     Scaff1 3959     4183
G1     Scaff1 4201     4407

as you can see all coordinates overlap with each other:

  • 2278 - 4437 overlaps with 2788 - 3048
  • 2788 - 3048 overlaps with 3959 - 4183
  • 3959 - 4183 overlaps with 4201 - 4407

so I group them all within the same Groups1:

Groups Scaff  start    end   New_group
G1     Scaff1 2278     4437  G1
G1     Scaff1 2788     3048  G1
G1     Scaff1 3959     4183  G1
G1     Scaff1 4201     4407  G1

When I say overlap I mean in that way for instance if we compare 1-10 and 3-7 it would give an overlap of 4.

For the other example in G1 - Scaff2: there is no overlap, then I put them in two different Groups

Groups Scaff  start    end      New_group
G1     Scaff2 4553     5000     G2
G1     Scaff2 6321     7700     G3

I should then get overall:

Groups Scaff  start    end   New_group
G1     Scaff1 2278     4437  G1
G1     Scaff1 2788     3048  G1
G1     Scaff1 3959     4183  G1
G1     Scaff1 4201     4407  G1
G1     Scaff2 4553     5000  G2
G1     Scaff2 6321     7700  G3
G1     Scaff3 2870     5083  G4
G1     Scaff4 1923     2042  G5
G1     Scaff5 663      2885  G6
G1     Scaff5 2145     2825  G6

So far I tried the following code:

is_overlapped = lambda x: x['start'] >= x['end'].shift(fill_value=-1)
tab['New_group'] = tab.sort_values(['Groups','Scaff','start','end']).groupby(['Groups','Scaff'],as_index=False).apply(is_overlapped).droplevel(0).cumsum()

Which gives:

  Groups       Scaff  start     end  New_group
0     G1      Scaff1   2278  4437.0          1
1     G1      Scaff1   2788  3048.0          1
2     G1      Scaff1   3959  4183.0          2
3     G1      Scaff1   4201  4407.0          3
4     G1      Scaff2   4553  5000.0          4
5     G1      Scaff2   6321  7700.0          5
6     G1      Scaff3   2870  5083.0          6
7     G1      Scaff4   1923  2042            7
8     G1      Scaff5   663   2885            9
9     G1      Scaff5   2145  2825.0          8

and as you can see, rows 0,1,2 and 3 should all be in the same New_group...

Here is the dataframe in dict format if it can helps :

{'Groups': {0: 'G1', 1: 'G1', 2: 'G1', 3: 'G1', 4: 'G1', 5: 'G1', 6: 'G1', 7: 'G1', 8: 'G1', 9: 'G1'}, 'Scaff': {0: 'Scaff1', 1: 'Scaff1', 2: 'Scaff1', 3: 'Scaff1', 4: 'Scaff2', 5: 'Scaff2', 6: 'Scaff3', 7: 'Scaff4', 8: 'Scaff5', 9: 'Scaff5'}, 'start': {0: 2278, 1: 2788, 2: 3959, 3: 4201, 4: 4553, 5: 6321, 6: 2870, 7: 1923, 8: 663, 9: 2145}, 'end': {0: 4437, 1: 3048, 2: 4183, 3: 4407, 4: 5000, 5: 7700, 6: 5083, 7: 2042, 8: 2885, 9: 2825}}

CodePudding user response:

Compare to the shifted cummax "end" per group:

g = df.groupby(['Groups', 'Scaff'], group_keys=False)
group = df['start'].gt(g['end'].apply(lambda s: s.shift().cummax())).cumsum()

df['New_group'] = (df.groupby(['Groups', 'Scaff', group])
                     .ngroup().add(1).astype(str)
                     .radd('G')
                  )

Output:

  Groups   Scaff  start   end New_group
0     G1  Scaff1   2278  4437        G1
1     G1  Scaff1   2788  3048        G1
2     G1  Scaff1   3959  4183        G1
3     G1  Scaff1   4201  4407        G1
4     G1  Scaff2   4553  5000        G2
5     G1  Scaff2   6321  7700        G3
6     G1  Scaff3   2870  5083        G4
7     G1  Scaff4   1923  2042        G5
8     G1  Scaff5    663  2885        G6
9     G1  Scaff5   2145  2825        G6
  • Related