I have a dataset with the columns start
, end
and type
:
structure(list(number = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1), start = structure(c(1121472000,
1184544000, 1196467200, 1186963200, 1199145600, 1228089600, 1232582400,
1233446400, 1246406400, 1262304000, 1271030400, 1285891200, 1293840000,
1344211200, 1367366400, 1413158400, 1427846400, 1441065600, 1451952000,
1461542400, 1464739200, 1491004800, 1501545600, 1506816000, 1509494400,
1548979200, 1557705600, 1561939200, 1572825600, 1590969600, 1606780800
), tzone = "UTC", class = c("POSIXct", "POSIXt")), end = structure(c(1184457600,
1196380800, 253402214400, 1199059200, 1228003200, 1232496000,
1233360000, 1246320000, 1262217600, 1270944000, 1285804800, 1293753600,
1344124800, 1367280000, 1413072000, 1427760000, 1440979200, 1451865600,
1461456000, 1464652800, 1490918400, 1501459200, 1506729600, 1509408000,
1548892800, 1557619200, 1561852800, 1572739200, 1590883200, 1606694400,
253402214400), tzone = "UTC", class = c("POSIXct", "POSIXt")),
type = c("A", "A", "A", "B", "B", "B", "C", "D", "D", "D",
"D", "D", "D", "D", "E", "D", "E", "E", "F", "E", "E", "E",
"E", "E", "E", "E", "E", "E", "E", "E", "E")), row.names = c(NA,
-31L), class = c("tbl_df", "tbl", "data.frame"))
tail(data, 20)
number start end type
<dbl> <dttm> <dttm> <chr>
1 1 2010-10-01 00:00:00 2010-12-31 00:00:00 D
2 1 2011-01-01 00:00:00 2012-08-05 00:00:00 D
3 1 2012-08-06 00:00:00 2013-04-30 00:00:00 D
4 1 2013-05-01 00:00:00 2014-10-12 00:00:00 E
5 1 2014-10-13 00:00:00 2015-03-31 00:00:00 D
6 1 2015-04-01 00:00:00 2015-08-31 00:00:00 E
7 1 2015-09-01 00:00:00 2016-01-04 00:00:00 E
8 1 2016-01-05 00:00:00 2016-04-24 00:00:00 F
9 1 2016-04-25 00:00:00 2016-05-31 00:00:00 E
10 1 2016-06-01 00:00:00 2017-03-31 00:00:00 E
11 1 2017-04-01 00:00:00 2017-07-31 00:00:00 E
12 1 2017-08-01 00:00:00 2017-09-30 00:00:00 E
13 1 2017-10-01 00:00:00 2017-10-31 00:00:00 E
14 1 2017-11-01 00:00:00 2019-01-31 00:00:00 E
15 1 2019-02-01 00:00:00 2019-05-12 00:00:00 E
16 1 2019-05-13 00:00:00 2019-06-30 00:00:00 E
17 1 2019-07-01 00:00:00 2019-11-03 00:00:00 E
18 1 2019-11-04 00:00:00 2020-05-31 00:00:00 E
19 1 2020-06-01 00:00:00 2020-11-30 00:00:00 E
20 1 2020-12-01 00:00:00 9999-12-31 00:00:00 E
Question:
I want to summarise the data based on the group of column type but I want the groups to be only for adjacent rows of the same value.
The relevant column here is type
, where I want to group not only based on the value itslef e.g. "E", or "D", but also group based on the position (only group adjacend values of "D", "E", etc. together.
As of now I could obviously use:
> data_mre %>%
group_by(number, type) %>%
summarise(start = dplyr::first(start),
end = dplyr::last(end),
.groups = "drop")
# A tibble: 6 x 4
number type start end
<dbl> <chr> <dttm> <dttm>
1 1 A 2005-07-16 00:00:00 9999-12-31 00:00:00
2 1 B 2007-08-13 00:00:00 2009-01-21 00:00:00
3 1 C 2009-01-22 00:00:00 2009-01-31 00:00:00
4 1 D 2009-02-01 00:00:00 2015-03-31 00:00:00
5 1 E 2013-05-01 00:00:00 9999-12-31 00:00:00
6 1 F 2016-01-05 00:00:00 2016-04-24 00:00:00
But this would build groups that overlap the rows and dont take the position within the rows into account. Is there a way that I could also take the row as an argument in the group_by()
statement?
This is what I imagine the result should look like:
number type start end
<dbl> <chr> <dttm> <dttm>
1 1 A 2005-07-16 00:00:00 9999-12-31 00:00:00
2 1 B 2007-08-13 00:00:00 2009-01-21 00:00:00
3 1 C 2009-01-22 00:00:00 2009-01-31 00:00:00
4 1 D 2009-02-01 00:00:00 2013-04-30 00:00:00
5 1 E 2013-05-01 00:00:00 2014-10-12 00:00:00
6 1 D 2014-10-13 00:00:00 2015-03-31 00:00:00
5 1 E 2015-04-01 00:00:00 2016-01-04 00:00:00
6 1 F 2016-01-05 00:00:00 2016-04-24 00:00:00
5 1 E 2016-04-25 00:00:00 9999-12-31 00:00:00
CodePudding user response:
You can use data.table::rleid
Using tidy/dplyr:
data %>%
group_by(group = data.table::rleid(type)) %>%
summarize(number= first(number), type = first(type), start=first(start), end=last(end)) %>%
select(!group)
Ouput:
number type start end
<dbl> <chr> <dttm> <dttm>
1 1 A 2005-07-16 00:00:00 9999-12-31 00:00:00
2 1 B 2007-08-13 00:00:00 2009-01-21 00:00:00
3 1 C 2009-01-22 00:00:00 2009-01-31 00:00:00
4 1 D 2009-02-01 00:00:00 2013-04-30 00:00:00
5 1 E 2013-05-01 00:00:00 2014-10-12 00:00:00
6 1 D 2014-10-13 00:00:00 2015-03-31 00:00:00
7 1 E 2015-04-01 00:00:00 2016-01-04 00:00:00
8 1 F 2016-01-05 00:00:00 2016-04-24 00:00:00
9 1 E 2016-04-25 00:00:00 9999-12-31 00:00:00
Using data.table
library(data.table)
setDT(data)[, .(
number=first(number),
start = first(start),
end = last(end),
type = first(type)
),
by=.(rleid(type))][,.(number, type, start, end )]
Output
number type start end
<num> <char> <POSc> <POSc>
1: 1 A 2005-07-16 9999-12-31
2: 1 B 2007-08-13 2009-01-21
3: 1 C 2009-01-22 2009-01-31
4: 1 D 2009-02-01 2013-04-30
5: 1 E 2013-05-01 2014-10-12
6: 1 D 2014-10-13 2015-03-31
7: 1 E 2015-04-01 2016-01-04
8: 1 F 2016-01-05 2016-04-24
9: 1 E 2016-04-25 9999-12-31