Assume a toy data.table of two integer columns x
aand y
. The x
coumn could contain any integer but y column will have a value one larger (next consecutive integer).
data.table::data.table(
x = c(1L, 3L, 7L, 8L, 11L, 12L, 13L, 15L),
y = c(2L, 4L, 8L, 9L, 12L, 13L, 14L, 16L)
)
#> x y
#> 1: 1 2
#> 2: 3 4
#> 3: 7 8
#> 4: 8 9
#> 5: 11 12
#> 6: 12 13
#> 7: 13 14
#> 8: 15 16
I want to collapse the rows that have consecutive running integer ranges (having one number common between rows) such that in the new data.table x
has the minimum and y
the max of the range. The only condition is that the range of integers (x:y
) will include all numbers that were present in the original rows that are replaced.
Here is the desired output:
data.table::data.table(
x = c(1L, 3L, 7L, 11L, 15L),
y = c(2L, 4L, 9L, 14L, 16L)
)
#> x y
#> 1: 1 2
#> 2: 3 4
#> 3: 7 9
#> 4: 11 14
#> 5: 15 16
Created on 2022-04-28 by the reprex package (v2.0.1)
This is such a simple requirement but I am unable to get my head around to solve it. There may be many ways to solve this but I am looking for an elegant way.
I am open to any solution using either data.table
or dpylr
.
CodePudding user response:
Create a grouping column and then summarise to return the first
of 'x' column and last
of 'y' column - grouping is created by checking whether the lag
values of 'y' are not equal to the current values of 'x', get the cumulative sum
library(data.table)
dt[, .(x = first(x), y = last(y)), .(grp = cumsum(shift(y,
fill = first(y)) != x))][, grp := NULL][]
-output
x y
<int> <int>
1: 1 2
2: 3 4
3: 7 9
4: 11 14
5: 15 16