Home > Back-end >  R: collapse by ID selecting certain character attributes over others
R: collapse by ID selecting certain character attributes over others

Time:03-23

I want to collapse microdata with multiple observations at different times per ID. Usually an ID has the same birth country but sometimes this changes. I want to collapse my data to one observation per ID and choose the country in a way that never chooses two specific countries (e.g. Canada and Germany). E.g. if there is one observation with birth country Canada and one with US, I want to choose the US. If a person has Italy and Germany I want to choose for example Italy. If there is only one country this should be kept.

My data:

ID  birth_country
1   US
1   Canada
1   Canada
1   Canada
2   Germany
2   Italy
2   Germany
3   Canada
3   Canada
3   Canada

Should look like:

ID  birth_country
1   US
2   Italy
3   Canada

I tried it with dplyer and group_by but couldn't figure out a proper way to formulate the selection by character.

CodePudding user response:

df %>%
   group_by(ID) %>%
   summarize(birth_country = first(birth_country))

should do the trick. Instead of first you can use last or nth (see ?summarise)

CodePudding user response:

With a preference vector that orders the country by preference:

df <- data.frame(ID = rep.int(1:3, c(4, 3, 3)), birth_country = c("US", "Canada", "Italy", "Germany")[c(1,2,2,2,4,3,4,2,2,2)])
preference <- c("US", "Italy", "Canada", "Germany")

library(dplyr)

df %>%
  group_by(ID) %>%
  summarize(birth_country = preference[min(match(birth_country, preference))])
#> # A tibble: 3 x 2
#>      ID birth_country
#>   <int> <chr>        
#> 1     1 US           
#> 2     2 Italy        
#> 3     3 Canada

Or data.table

library(data.table)

setDT(df)[, .(birth_country = preference[min(match(birth_country, preference))]), by = "ID"]
#>    ID birth_country
#> 1:  1            US
#> 2:  2         Italy
#> 3:  3        Canada
  •  Tags:  
  • r
  • Related