Home > database >  r collapse by year by ID
r collapse by year by ID

Time:09-30

I have a dataset with multiple rows per ID like this

ID     From    To     State
1      2004    2005   MD
1      2005    2005   MD
1      2005    2012   DC
1      2012    2015   DC
1      2015    2020   DC
1      2012    2013   MD
1      2013    2016   MD
1      2016    2019   MD
1      2019    2020   MD
2      2003    2004   OR
2      2004    2008   OR
2      2008    2013   AZ
2      2013    2015   AZ

My goal is to collapse the multiple From and To columns to create a smooth timeline like

ID     From    To     State
1      2004    2005   MD
1      2005    2020   DC
1      2012    2020   MD
2      2003    2008   OR
2      2008    2015   AZ

Not sure how to accomplish this. An help is much appreciated. Thanks.

CodePudding user response:

Group by 'ID', 'State' and the run-length-id of 'State', get the first of 'From' and last of 'To'

library(dplyr)
library(data.table)
df1 %>% 
    group_by(ID, State, grp = rleid(State)) %>% 
    summarise(From = first(From), To = last(To), .groups = 'drop') %>% 
    select(-grp)

-output

# A tibble: 5 × 4
     ID State  From    To
  <int> <chr> <int> <int>
1     1 DC     2005  2020
2     1 MD     2004  2005
3     1 MD     2012  2020
4     2 AZ     2008  2015
5     2 OR     2003  2008

data

df1 <- structure(list(ID = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 
2L, 2L, 2L), From = c(2004L, 2005L, 2005L, 2012L, 2015L, 2012L, 
2013L, 2016L, 2019L, 2003L, 2004L, 2008L, 2013L), To = c(2005L, 
2005L, 2012L, 2015L, 2020L, 2013L, 2016L, 2019L, 2020L, 2004L, 
2008L, 2013L, 2015L), State = c("MD", "MD", "DC", "DC", "DC", 
"MD", "MD", "MD", "MD", "OR", "OR", "AZ", "AZ")), 
class = "data.frame", row.names = c(NA, 
-13L))
  • Related