Home > Blockchain >  dplyr: group_by() based on value AND Position in column
dplyr: group_by() based on value AND Position in column

Time:02-16

I have a dataset with the columns start, end and type:

structure(list(number = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1), start = structure(c(1121472000, 
1184544000, 1196467200, 1186963200, 1199145600, 1228089600, 1232582400, 
1233446400, 1246406400, 1262304000, 1271030400, 1285891200, 1293840000, 
1344211200, 1367366400, 1413158400, 1427846400, 1441065600, 1451952000, 
1461542400, 1464739200, 1491004800, 1501545600, 1506816000, 1509494400, 
1548979200, 1557705600, 1561939200, 1572825600, 1590969600, 1606780800
), tzone = "UTC", class = c("POSIXct", "POSIXt")), end = structure(c(1184457600, 
1196380800, 253402214400, 1199059200, 1228003200, 1232496000, 
1233360000, 1246320000, 1262217600, 1270944000, 1285804800, 1293753600, 
1344124800, 1367280000, 1413072000, 1427760000, 1440979200, 1451865600, 
1461456000, 1464652800, 1490918400, 1501459200, 1506729600, 1509408000, 
1548892800, 1557619200, 1561852800, 1572739200, 1590883200, 1606694400, 
253402214400), tzone = "UTC", class = c("POSIXct", "POSIXt")), 
    type = c("A", "A", "A", "B", "B", "B", "C", "D", "D", "D", 
    "D", "D", "D", "D", "E", "D", "E", "E", "F", "E", "E", "E", 
    "E", "E", "E", "E", "E", "E", "E", "E", "E")), row.names = c(NA, 
-31L), class = c("tbl_df", "tbl", "data.frame"))
tail(data, 20)

   number start               end                 type
    <dbl> <dttm>              <dttm>              <chr>   
 1      1 2010-10-01 00:00:00 2010-12-31 00:00:00 D       
 2      1 2011-01-01 00:00:00 2012-08-05 00:00:00 D       
 3      1 2012-08-06 00:00:00 2013-04-30 00:00:00 D       
 4      1 2013-05-01 00:00:00 2014-10-12 00:00:00 E       
 5      1 2014-10-13 00:00:00 2015-03-31 00:00:00 D       
 6      1 2015-04-01 00:00:00 2015-08-31 00:00:00 E       
 7      1 2015-09-01 00:00:00 2016-01-04 00:00:00 E       
 8      1 2016-01-05 00:00:00 2016-04-24 00:00:00 F       
 9      1 2016-04-25 00:00:00 2016-05-31 00:00:00 E       
10      1 2016-06-01 00:00:00 2017-03-31 00:00:00 E       
11      1 2017-04-01 00:00:00 2017-07-31 00:00:00 E       
12      1 2017-08-01 00:00:00 2017-09-30 00:00:00 E       
13      1 2017-10-01 00:00:00 2017-10-31 00:00:00 E       
14      1 2017-11-01 00:00:00 2019-01-31 00:00:00 E       
15      1 2019-02-01 00:00:00 2019-05-12 00:00:00 E       
16      1 2019-05-13 00:00:00 2019-06-30 00:00:00 E       
17      1 2019-07-01 00:00:00 2019-11-03 00:00:00 E       
18      1 2019-11-04 00:00:00 2020-05-31 00:00:00 E       
19      1 2020-06-01 00:00:00 2020-11-30 00:00:00 E       
20      1 2020-12-01 00:00:00 9999-12-31 00:00:00 E

Question:

I want to summarise the data based on the group of column type but I want the groups to be only for adjacent rows of the same value. The relevant column here is type, where I want to group not only based on the value itslef e.g. "E", or "D", but also group based on the position (only group adjacend values of "D", "E", etc. together.

As of now I could obviously use:

> data_mre %>%
    group_by(number, type) %>%
    summarise(start = dplyr::first(start),
              end = dplyr::last(end),
              .groups = "drop")
# A tibble: 6 x 4
  number type  start               end                
   <dbl> <chr> <dttm>              <dttm>             
1      1 A     2005-07-16 00:00:00 9999-12-31 00:00:00
2      1 B     2007-08-13 00:00:00 2009-01-21 00:00:00
3      1 C     2009-01-22 00:00:00 2009-01-31 00:00:00
4      1 D     2009-02-01 00:00:00 2015-03-31 00:00:00
5      1 E     2013-05-01 00:00:00 9999-12-31 00:00:00
6      1 F     2016-01-05 00:00:00 2016-04-24 00:00:00

But this would build groups that overlap the rows and dont take the position within the rows into account. Is there a way that I could also take the row as an argument in the group_by()statement?

This is what I imagine the result should look like:

  number type  start               end                
   <dbl> <chr> <dttm>              <dttm>             
1      1 A     2005-07-16 00:00:00 9999-12-31 00:00:00
2      1 B     2007-08-13 00:00:00 2009-01-21 00:00:00
3      1 C     2009-01-22 00:00:00 2009-01-31 00:00:00
4      1 D     2009-02-01 00:00:00 2013-04-30 00:00:00
5      1 E     2013-05-01 00:00:00 2014-10-12 00:00:00
6      1 D     2014-10-13 00:00:00 2015-03-31 00:00:00
5      1 E     2015-04-01 00:00:00 2016-01-04 00:00:00
6      1 F     2016-01-05 00:00:00 2016-04-24 00:00:00
5      1 E     2016-04-25 00:00:00 9999-12-31 00:00:00

CodePudding user response:

You can use data.table::rleid

Using tidy/dplyr:

data %>% 
  group_by(group = data.table::rleid(type)) %>%
  summarize(number= first(number), type = first(type), start=first(start), end=last(end)) %>% 
  select(!group)

Ouput:

  number type  start               end                
   <dbl> <chr> <dttm>              <dttm>             
1      1 A     2005-07-16 00:00:00 9999-12-31 00:00:00
2      1 B     2007-08-13 00:00:00 2009-01-21 00:00:00
3      1 C     2009-01-22 00:00:00 2009-01-31 00:00:00
4      1 D     2009-02-01 00:00:00 2013-04-30 00:00:00
5      1 E     2013-05-01 00:00:00 2014-10-12 00:00:00
6      1 D     2014-10-13 00:00:00 2015-03-31 00:00:00
7      1 E     2015-04-01 00:00:00 2016-01-04 00:00:00
8      1 F     2016-01-05 00:00:00 2016-04-24 00:00:00
9      1 E     2016-04-25 00:00:00 9999-12-31 00:00:00

Using data.table

library(data.table)

setDT(data)[, .(
  number=first(number),
  start = first(start), 
  end = last(end), 
  type = first(type)
  ), 
  by=.(rleid(type))][,.(number, type, start, end )]

Output

   number   type      start        end
    <num> <char>     <POSc>     <POSc>
1:      1      A 2005-07-16 9999-12-31
2:      1      B 2007-08-13 2009-01-21
3:      1      C 2009-01-22 2009-01-31
4:      1      D 2009-02-01 2013-04-30
5:      1      E 2013-05-01 2014-10-12
6:      1      D 2014-10-13 2015-03-31
7:      1      E 2015-04-01 2016-01-04
8:      1      F 2016-01-05 2016-04-24
9:      1      E 2016-04-25 9999-12-31
  • Related