Home > front end >  Grouping linear intervals by distance cutoff
Grouping linear intervals by distance cutoff

Time:06-04

I have an R data.frame of linear intervals:

df <- data.frame(id = paste0("i",1:15),
                 start = c(6575,7156,7949,45835,46347,47168,126804,127276,128127,157597,158074,158902,199129,199704,200507),
                 end = c(6928,7392,8260,46104,46610,47485,127079,127542,128417,157872,158340,159219,199374,199951,200938))

I also have an inter-interval distance cutoff:

inter.interval.distance.cutoff <- 3243

df is sorted by start and end. The first interval is labeled to belong to group g1 and from there on any interval which is separated by the interval preceding it by a distance (which is defined as start of the current interval minus the end of the interval preceding it) that's equal or less to inter.interval.distance.cutoff is assigned to the group of the interval preceding it, otherwise it starts a new group (the group index is incremented by 1 which is how ew get a new group label).

Here's my desired outcome:

df$group <- c(rep("g1",3), rep("g2",3), rep("g3",3), rep("g4",3), rep("g5",3))

Any fast way for doing it?

CodePudding user response:

df$group <- paste0('g', cumsum(c(1, diff(df$start)>inter.interval.distance.cutoff)))

    id  start    end  f
1   i1   6575   6928 g1
2   i2   7156   7392 g1
3   i3   7949   8260 g1
4   i4  45835  46104 g2
5   i5  46347  46610 g2
6   i6  47168  47485 g2
7   i7 126804 127079 g3
8   i8 127276 127542 g3
9   i9 128127 128417 g3
10 i10 157597 157872 g4
11 i11 158074 158340 g4
12 i12 158902 159219 g4
13 i13 199129 199374 g5
14 i14 199704 199951 g5
15 i15 200507 200938 g5
  • Related