I have an R
data.frame
of linear intervals:
df <- data.frame(id = paste0("i",1:15),
start = c(6575,7156,7949,45835,46347,47168,126804,127276,128127,157597,158074,158902,199129,199704,200507),
end = c(6928,7392,8260,46104,46610,47485,127079,127542,128417,157872,158340,159219,199374,199951,200938))
I also have an inter-interval distance cutoff:
inter.interval.distance.cutoff <- 3243
df
is sorted by start
and end
. The first interval is labeled to belong to group
g1
and from there on any interval which is separated by the interval preceding it by a distance (which is defined as start
of the current interval minus the end
of the interval preceding it) that's equal or less to inter.interval.distance.cutoff
is assigned to the group
of the interval preceding it, otherwise it starts a new group (the group
index is incremented by 1 which is how ew get a new group
label).
Here's my desired outcome:
df$group <- c(rep("g1",3), rep("g2",3), rep("g3",3), rep("g4",3), rep("g5",3))
Any fast way for doing it?
CodePudding user response:
df$group <- paste0('g', cumsum(c(1, diff(df$start)>inter.interval.distance.cutoff)))
id start end f
1 i1 6575 6928 g1
2 i2 7156 7392 g1
3 i3 7949 8260 g1
4 i4 45835 46104 g2
5 i5 46347 46610 g2
6 i6 47168 47485 g2
7 i7 126804 127079 g3
8 i8 127276 127542 g3
9 i9 128127 128417 g3
10 i10 157597 157872 g4
11 i11 158074 158340 g4
12 i12 158902 159219 g4
13 i13 199129 199374 g5
14 i14 199704 199951 g5
15 i15 200507 200938 g5