I have a big file and I try to find a way to do sorting or do clustering of the data according to two numeric columns that are in a range of numbers, but I could not find correct or fit function regarding my question. Could you please someone how knows help me.
Thanks in advance.
My file is like this sample file but very big and as you see in this example, first and second rows are like alternating numbers (I mean without any gap in between (sequence number)) and also third and forth rows are like that, but rows fifth and sixth are different and actually far from eachother. Therefore, I want to consider first and second as a one cluster, third and forth as a one cluster, fifth and sixth as a two different clusters to have at the end 4 rows instead of 6 rows because rows 1,2 and 3,4 are in one range without any gap in between.
Example file:
df <- setDT(data.frame(name = c("chr1", "chr1", "chr1", "chr1","chr1","chr1"),
start = c(8480001, 8480251, 10006251, 10006501,13910501,14841751),
end = c(8480250, 8480500, 10006500, 10006750,13910750,14842000),
length = c(250, 250, 250, 250,250,250)))
Expected output:
output <- setDT(data.frame(name = c("chr1", "chr1", "chr1", "chr1"),
start = c(8480001, 10006251, 13910501, 14841751),
end = c(8480250, 10006500, 13910750, 14842000),
length = c(250, 250, 250, 250)))
In output, just I want to have first row of those rows that are in one cluster for example just row 1 for 1 and 2.
Thanks again.
CodePudding user response:
We could create a group based on the difference between the 'start' and the lag
of 'end' and take the first row
library(data.table)
df[df[, .I[1], cumsum(start - shift(end, fill = first(end)) > 1)]$V1]
-output
name start end length
<char> <num> <num> <num>
1: chr1 8480001 8480250 250
2: chr1 10006251 10006500 250
3: chr1 13910501 13910750 250
4: chr1 14841751 14842000 250