I have a dataset with multiple rows per ID like this
ID From To State
1 2004 2005 MD
1 2005 2005 MD
1 2005 2012 DC
1 2012 2015 DC
1 2015 2020 DC
1 2012 2013 MD
1 2013 2016 MD
1 2016 2019 MD
1 2019 2020 MD
2 2003 2004 OR
2 2004 2008 OR
2 2008 2013 AZ
2 2013 2015 AZ
My goal is to collapse the multiple From and To columns to create a smooth timeline like
ID From To State
1 2004 2005 MD
1 2005 2020 DC
1 2012 2020 MD
2 2003 2008 OR
2 2008 2015 AZ
Not sure how to accomplish this. An help is much appreciated. Thanks.
CodePudding user response:
Group by 'ID', 'State' and the run-length-id of 'State', get the first
of 'From' and last
of 'To'
library(dplyr)
library(data.table)
df1 %>%
group_by(ID, State, grp = rleid(State)) %>%
summarise(From = first(From), To = last(To), .groups = 'drop') %>%
select(-grp)
-output
# A tibble: 5 × 4
ID State From To
<int> <chr> <int> <int>
1 1 DC 2005 2020
2 1 MD 2004 2005
3 1 MD 2012 2020
4 2 AZ 2008 2015
5 2 OR 2003 2008
data
df1 <- structure(list(ID = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L,
2L, 2L, 2L), From = c(2004L, 2005L, 2005L, 2012L, 2015L, 2012L,
2013L, 2016L, 2019L, 2003L, 2004L, 2008L, 2013L), To = c(2005L,
2005L, 2012L, 2015L, 2020L, 2013L, 2016L, 2019L, 2020L, 2004L,
2008L, 2013L, 2015L), State = c("MD", "MD", "DC", "DC", "DC",
"MD", "MD", "MD", "MD", "OR", "OR", "AZ", "AZ")),
class = "data.frame", row.names = c(NA,
-13L))