Extend date intervals by two different IDs in R-CodePudding

I have a big data.frame where I can find a municipality id for each person(id) over time. Sometimes they move and there will be a new municipality_id from the next day on. However, sometimes the municipality_id stays the same. I would like to collapse those date intervals by each id if there is no real change in municipality_id

This data


 id municipality_id       from         to
1  A             820 2007-01-01 2007-02-28
2  A             200 2007-03-01 2100-01-01
3  B             820 2007-01-01 2007-03-31
4  B             820 2007-04-01 2007-05-31
5  B             830 2007-06-01 2008-01-31
6  B             830 2008-02-01 2100-01-01
7  C             700 2007-01-01 2007-05-31
8  C             500 2007-06-01 2008-12-31
9  C             700 2009-01-01 2100-01-01

should turn into this (two observations for B were extended)


 id municipality_id       from         to
1  A             820 2007-01-01 2007-02-28
2  A             200 2007-03-01 2100-01-01
3  B             820 2007-01-01 2007-05-31
4  B             830 2007-06-01 2100-01-01
5  C             700 2007-01-01 2007-05-31
6  C             500 2007-06-01 2008-12-31
7  C             700 2009-01-01 2100-01-01

Here the code to create my two tables:

data <- data.frame(id = c('A', 'A', 'B', 'B','B', 'B', 'C', 'C', 'C'),
                   municipality_id = c(820, 200, 820, 820, 830, 830, 700, 500, 700),
                   from = as.Date(c("2007-01-01", "2007-03-01", "2007-01-01", "2007-04-01", "2007-06-01", "2008-02-01", "2007-01-01", "2007-06-01", "2009-01-01")),
                   to = as.Date(c("2007-02-28", "2100-01-01", "2007-03-31", "2007-05-31", "2008-01-31", "2100-01-01", "2007-05-31", "2008-12-31", "2100-01-01")))

Should turn into:

data_edit <- data.frame(id = c('A', 'A', 'B', 'B', 'C', 'C', 'C'),
                   municipality_id = c(820, 200, 820, 830, 700, 500, 700),
                   from = as.Date(c("2007-01-01", "2007-03-01", "2007-01-01", "2007-06-01", "2007-01-01", "2007-06-01", "2009-01-01")),
                   to = as.Date(c("2007-02-28", "2100-01-01", "2007-05-31", "2100-01-01", "2007-05-31", "2008-12-31", "2100-01-01")))

Is there an easy solution with R? Thank you for helping me :)

CodePudding user response：

How about this:

data <- data.frame(id = c('A', 'A', 'B', 'B','B', 'B', 'C', 'C', 'C'),
                   municipality_id = c(820, 200, 820, 820, 830, 830, 700, 500, 700),
                   from = as.Date(c("2007-01-01", "2007-03-01", "2007-01-01", "2007-04-01", "2007-06-01", "2008-02-01", "2007-01-01", "2007-06-01", "2009-01-01")),
                   to = as.Date(c("2007-02-28", "2100-01-01", "2007-03-31", "2007-05-31", "2008-01-31", "2100-01-01", "2007-05-31", "2008-12-31", "2100-01-01"))) 

library(tidyr)
library(dplyr)
data %>% 
  arrange(id, from) %>% 
  group_by(id) %>% 
  mutate(diff_mid = ifelse(municipality_id == lag(municipality_id), 0, 1), 
         diff_mid = ifelse(is.na(diff_mid), 0, diff_mid), 
         gp = cumsum(diff_mid)) %>%
  pivot_longer(from:to, names_to="frto", values_to="vals") %>% 
  group_by(id, municipality_id, gp) %>% 
  summarise(from = min(vals), to=max(vals)) %>% 
  ungroup %>% 
  arrange(id, from) %>% 
  ungroup %>% 
  select(-gp)
#> `summarise()` has grouped output by 'id', 'municipality_id'. You can override
#> using the `.groups` argument.
#> # A tibble: 7 × 4
#>   id    municipality_id from       to        
#>   <chr>           <dbl> <date>     <date>    
#> 1 A                 820 2007-01-01 2007-02-28
#> 2 A                 200 2007-03-01 2100-01-01
#> 3 B                 820 2007-01-01 2007-05-31
#> 4 B                 830 2007-06-01 2100-01-01
#> 5 C                 700 2007-01-01 2007-05-31
#> 6 C                 500 2007-06-01 2008-12-31
#> 7 C                 700 2009-01-01 2100-01-01

^{Created on 2022-12-06 by the reprex package (v2.0.1)}