I have a big data.frame where I can find a municipality id for each person(id) over time. Sometimes they move and there will be a new municipality_id from the next day on. However, sometimes the municipality_id stays the same. I would like to collapse those date intervals by each id if there is no real change in municipality_id
This data
id municipality_id from to
1 A 820 2007-01-01 2007-02-28
2 A 200 2007-03-01 2100-01-01
3 B 820 2007-01-01 2007-03-31
4 B 820 2007-04-01 2007-05-31
5 B 830 2007-06-01 2008-01-31
6 B 830 2008-02-01 2100-01-01
7 C 700 2007-01-01 2007-05-31
8 C 500 2007-06-01 2008-12-31
9 C 700 2009-01-01 2100-01-01
should turn into this (two observations for B were extended)
id municipality_id from to
1 A 820 2007-01-01 2007-02-28
2 A 200 2007-03-01 2100-01-01
3 B 820 2007-01-01 2007-05-31
4 B 830 2007-06-01 2100-01-01
5 C 700 2007-01-01 2007-05-31
6 C 500 2007-06-01 2008-12-31
7 C 700 2009-01-01 2100-01-01
Here the code to create my two tables:
data <- data.frame(id = c('A', 'A', 'B', 'B','B', 'B', 'C', 'C', 'C'),
municipality_id = c(820, 200, 820, 820, 830, 830, 700, 500, 700),
from = as.Date(c("2007-01-01", "2007-03-01", "2007-01-01", "2007-04-01", "2007-06-01", "2008-02-01", "2007-01-01", "2007-06-01", "2009-01-01")),
to = as.Date(c("2007-02-28", "2100-01-01", "2007-03-31", "2007-05-31", "2008-01-31", "2100-01-01", "2007-05-31", "2008-12-31", "2100-01-01")))
Should turn into:
data_edit <- data.frame(id = c('A', 'A', 'B', 'B', 'C', 'C', 'C'),
municipality_id = c(820, 200, 820, 830, 700, 500, 700),
from = as.Date(c("2007-01-01", "2007-03-01", "2007-01-01", "2007-06-01", "2007-01-01", "2007-06-01", "2009-01-01")),
to = as.Date(c("2007-02-28", "2100-01-01", "2007-05-31", "2100-01-01", "2007-05-31", "2008-12-31", "2100-01-01")))
Is there an easy solution with R? Thank you for helping me :)
CodePudding user response:
How about this:
data <- data.frame(id = c('A', 'A', 'B', 'B','B', 'B', 'C', 'C', 'C'),
municipality_id = c(820, 200, 820, 820, 830, 830, 700, 500, 700),
from = as.Date(c("2007-01-01", "2007-03-01", "2007-01-01", "2007-04-01", "2007-06-01", "2008-02-01", "2007-01-01", "2007-06-01", "2009-01-01")),
to = as.Date(c("2007-02-28", "2100-01-01", "2007-03-31", "2007-05-31", "2008-01-31", "2100-01-01", "2007-05-31", "2008-12-31", "2100-01-01")))
library(tidyr)
library(dplyr)
data %>%
arrange(id, from) %>%
group_by(id) %>%
mutate(diff_mid = ifelse(municipality_id == lag(municipality_id), 0, 1),
diff_mid = ifelse(is.na(diff_mid), 0, diff_mid),
gp = cumsum(diff_mid)) %>%
pivot_longer(from:to, names_to="frto", values_to="vals") %>%
group_by(id, municipality_id, gp) %>%
summarise(from = min(vals), to=max(vals)) %>%
ungroup %>%
arrange(id, from) %>%
ungroup %>%
select(-gp)
#> `summarise()` has grouped output by 'id', 'municipality_id'. You can override
#> using the `.groups` argument.
#> # A tibble: 7 × 4
#> id municipality_id from to
#> <chr> <dbl> <date> <date>
#> 1 A 820 2007-01-01 2007-02-28
#> 2 A 200 2007-03-01 2100-01-01
#> 3 B 820 2007-01-01 2007-05-31
#> 4 B 830 2007-06-01 2100-01-01
#> 5 C 700 2007-01-01 2007-05-31
#> 6 C 500 2007-06-01 2008-12-31
#> 7 C 700 2009-01-01 2100-01-01
Created on 2022-12-06 by the reprex package (v2.0.1)