Home > Mobile >  summarising a 2 column data.table to continuous integers
summarising a 2 column data.table to continuous integers

Time:04-29

Assume a toy data.table of two integer columns x aand y. The x coumn could contain any integer but y column will have a value one larger (next consecutive integer).

data.table::data.table(
           x = c(1L, 3L, 7L, 8L, 11L, 12L, 13L, 15L),
           y = c(2L, 4L, 8L, 9L, 12L, 13L, 14L, 16L)
)
#>     x  y
#> 1:  1  2
#> 2:  3  4
#> 3:  7  8
#> 4:  8  9
#> 5: 11 12
#> 6: 12 13
#> 7: 13 14
#> 8: 15 16

I want to collapse the rows that have consecutive running integer ranges (having one number common between rows) such that in the new data.table x has the minimum and y the max of the range. The only condition is that the range of integers (x:y) will include all numbers that were present in the original rows that are replaced.

Here is the desired output:

data.table::data.table(
           x = c(1L, 3L, 7L, 11L, 15L),
           y = c(2L, 4L, 9L,  14L, 16L)
)
#>     x  y
#> 1:  1  2
#> 2:  3  4
#> 3:  7  9
#> 4: 11 14
#> 5: 15 16

Created on 2022-04-28 by the reprex package (v2.0.1)

This is such a simple requirement but I am unable to get my head around to solve it. There may be many ways to solve this but I am looking for an elegant way.

I am open to any solution using either data.table or dpylr.

CodePudding user response:

Create a grouping column and then summarise to return the first of 'x' column and last of 'y' column - grouping is created by checking whether the lag values of 'y' are not equal to the current values of 'x', get the cumulative sum

library(data.table)
dt[, .(x = first(x), y = last(y)), .(grp = cumsum(shift(y, 
     fill = first(y)) != x))][, grp := NULL][]

-output

      x     y
   <int> <int>
1:     1     2
2:     3     4
3:     7     9
4:    11    14
5:    15    16
  • Related