I want to collapse microdata with multiple observations at different times per ID. Usually an ID has the same birth country but sometimes this changes. I want to collapse my data to one observation per ID and choose the country in a way that never chooses two specific countries (e.g. Canada and Germany). E.g. if there is one observation with birth country Canada and one with US, I want to choose the US. If a person has Italy and Germany I want to choose for example Italy. If there is only one country this should be kept.
My data:
ID birth_country
1 US
1 Canada
1 Canada
1 Canada
2 Germany
2 Italy
2 Germany
3 Canada
3 Canada
3 Canada
Should look like:
ID birth_country
1 US
2 Italy
3 Canada
I tried it with dplyer and group_by but couldn't figure out a proper way to formulate the selection by character.
CodePudding user response:
df %>%
group_by(ID) %>%
summarize(birth_country = first(birth_country))
should do the trick. Instead of first
you can use last
or nth
(see ?summarise
)
CodePudding user response:
With a preference
vector that orders the country by preference:
df <- data.frame(ID = rep.int(1:3, c(4, 3, 3)), birth_country = c("US", "Canada", "Italy", "Germany")[c(1,2,2,2,4,3,4,2,2,2)])
preference <- c("US", "Italy", "Canada", "Germany")
library(dplyr)
df %>%
group_by(ID) %>%
summarize(birth_country = preference[min(match(birth_country, preference))])
#> # A tibble: 3 x 2
#> ID birth_country
#> <int> <chr>
#> 1 1 US
#> 2 2 Italy
#> 3 3 Canada
Or data.table
library(data.table)
setDT(df)[, .(birth_country = preference[min(match(birth_country, preference))]), by = "ID"]
#> ID birth_country
#> 1: 1 US
#> 2: 2 Italy
#> 3: 3 Canada