Is there any way to do clustering or sorting of file according to two numeric range value columns?-CodePudding

I have a big file and I try to find a way to do sorting or do clustering of the data according to two numeric columns that are in a range of numbers, but I could not find correct or fit function regarding my question. Could you please someone how knows help me.
Thanks in advance.

My file is like this sample file but very big and as you see in this example, first and second rows are like alternating numbers (I mean without any gap in between (sequence number)) and also third and forth rows are like that, but rows fifth and sixth are different and actually far from eachother. Therefore, I want to consider first and second as a one cluster, third and forth as a one cluster, fifth and sixth as a two different clusters to have at the end 4 rows instead of 6 rows because rows 1,2 and 3,4 are in one range without any gap in between.
Example file:

df <- setDT(data.frame(name = c("chr1", "chr1", "chr1", "chr1","chr1","chr1"), 
  start = c(8480001, 8480251, 10006251, 10006501,13910501,14841751), 
  end = c(8480250, 8480500, 10006500, 10006750,13910750,14842000),
  length = c(250, 250, 250, 250,250,250)))

Expected output:

output <- setDT(data.frame(name = c("chr1", "chr1", "chr1", "chr1"),
  start = c(8480001, 10006251, 13910501, 14841751), 
  end = c(8480250, 10006500, 13910750, 14842000), 
  length = c(250, 250, 250, 250)))

In output, just I want to have first row of those rows that are in one cluster for example just row 1 for 1 and 2.

Thanks again.

CodePudding user response：

We could create a group based on the difference between the 'start' and the lag of 'end' and take the first row

library(data.table)
df[df[, .I[1], cumsum(start - shift(end, fill = first(end)) > 1)]$V1]

-output

   name    start      end length
   <char>    <num>    <num>  <num>
1:   chr1  8480001  8480250    250
2:   chr1 10006251 10006500    250
3:   chr1 13910501 13910750    250
4:   chr1 14841751 14842000    250