I would like to join two datasets covering information from individuals from the same households but I have no access to individual IDs. I only that observations are linked together when they are part of the same household, but the group IDs do not match between the dataframes.
I was thinking of matching these data with the variables shared between the two dataframes, knowing that the probability of having duplicate matches tends to zero as the set of these shared variables increases. In my case, however, this set of variables is not large enough.
Here is a working example with only two shared variables, age and gender:
Df1 <- data.frame(family_id = c(1,1,1,2,2,3,3,3), age = c(42,37,10,42,35,42,37,10), gender= c("M","F","M","M","F","M","F","F"), tenure = c(12,10,NA,15,9,9,10,NA) )
Df2 <- data.frame(family_id = c("C","C","C","A","A","B","B","B"), age = c(42,37,10,42,35,42,37,10), gender= c("M","F","M","M","F","M","F","F"), employed= c(0,1,NA,1,1,1,0,NA))
The end result should be something like this:
Df3 <- data.frame(family_id_x = c(1,1,1,2,2,3,3,3), family_id_y = c("C","C","C","A","A","B","B","B"), age = c(42,37,10,42,35,42,37,10), gender= c("M","F","M","M","F","M","F","F"), tenure = c(12,10,NA,15,9,9,10,NA), employed= c(0,1,NA,1,1,1,0,NA))
Using a simple left join, however, I still generate duplicates: for example, observations with age = 42 and gender = M appear three times and lead to a duplicate match.
left_join(Df1, Df2, by = c("age","gender"))
I would then like to exploit information on other households members to have no duplicates in my join. For example, while there are three obs. with age = 42 and gender = M, only one of them shares the same household with 2 other members, one having age = 37 and gender = F, and the other one age = 10 and gender = M.
One solution I thought of was widening the dataset so that each row also contains information about the other members of the household, but this can become very cumbersome coding-wise. Ordering the observations and appending the columns might also fail if Df2 has more observations than Df1, and if some observations differ between the two datasets. I wonder if there is a more elegant solution, allowing for looking for a combination of matches, like in these pictures: Df1 Df2
In other words, is there a joining function that matches observations which simultaneously share the same combination of variables and are all part of the same group of linked observations? Many thanks in advance for your patience and help.
CodePudding user response:
Well, if you are very confident that all family members are present in both tables and they do share the same characteristics, you can do them join like this:
library(dplyr)
Df1 <- data.frame(family_id = c(1,1,1,2,2,3,3,3), age = c(42,37,10,42,35,42,37,10), gender= c("M","F","M","M","F","M","F","F"), tenure = c(12,10,NA,15,9,9,10,NA) )
Df2 <- data.frame(family_id = c("C","C","C","A","A","B","B","B"), age = c(42,37,10,42,35,42,37,10), gender= c("M","F","M","M","F","M","F","F"), employed= c(0,1,NA,1,1,1,0,NA))
Df3 <- data.frame(family_id_x = c(1,1,1,2,2,3,3,3), family_id_y = c("C","C","C","A","A","B","B","B"), age = c(42,37,10,42,35,42,37,10), gender= c("M","F","M","M","F","M","F","F"), tenure = c(12,10,NA,15,9,9,10,NA), employed= c(0,1,NA,1,1,1,0,NA))
# aggregate by family
m1 <- Df1 |>
# keep unique combinations only. Not an issue in your example data
select(family_id, age, gender) |>
unique() |>
# important to have the same order in both tables
arrange(family_id, age, gender) |>
group_by(family_id) |>
summarise(age_gender = paste(age, gender, collapse = ", "))
m1
#> # A tibble: 3 × 2
#> family_id age_gender
#> <dbl> <chr>
#> 1 1 10 M, 37 F, 42 M
#> 2 2 35 F, 42 M
#> 3 3 10 F, 37 F, 42 M
m2 <- Df2 |>
# keep unique combinations only. Not an issue in your example data
select(family_id, age, gender) |>
unique() |>
# important to have the same order in both tables
arrange(family_id, age, gender) |>
group_by(family_id) |>
# create helper variable to identify same family members
summarise(age_gender = paste(age, gender, collapse = ", "))
m2
#> # A tibble: 3 × 2
#> family_id age_gender
#> <chr> <chr>
#> 1 A 35 F, 42 M
#> 2 B 10 F, 37 F, 42 M
#> 3 C 10 M, 37 F, 42 M
# create a mapping table of ids based on demographic characteristics (age gender)
matches <- full_join(m1, m2, by = "age_gender", suffix = c("_x", "_y")) |> select(family_id_x, family_id_y)
matches
#> # A tibble: 3 × 2
#> family_id_x family_id_y
#> <dbl> <chr>
#> 1 1 C
#> 2 2 A
#> 3 3 B
# join both initial tables using ther matched ids
df_result <- Df1 |>
left_join(matches, by = c("family_id" = "family_id_x")) |>
left_join(Df2, by = c("family_id_y" = "family_id", "age", "gender")) |>
select(family_id_x = family_id, family_id_y, age, gender, tenure, employed)
df_result
#> family_id_x family_id_y age gender tenure employed
#> 1 1 C 42 M 12 0
#> 2 1 C 37 F 10 1
#> 3 1 C 10 M NA NA
#> 4 2 A 42 M 15 1
#> 5 2 A 35 F 9 1
#> 6 3 B 42 M 9 1
#> 7 3 B 37 F 10 0
#> 8 3 B 10 F NA NA
# result as expected
all.equal(df_result, Df3)
#> [1] TRUE
second solution to create the mapping table
# solution by full join
df_full <- full_join(Df1 |>
select(family_id, age, gender) |>
unique(),
Df2 |>
select(family_id, age, gender) |>
unique(),
by = c("age", "gender"))
df_full |>
group_by(family_id.x, family_id.y) |>
summarise(n = n()) |>
filter(n == max(n))
#> `summarise()` has grouped output by 'family_id.x'. You can override using the
#> `.groups` argument.
#> # A tibble: 3 × 3
#> # Groups: family_id.x [3]
#> family_id.x family_id.y n
#> <dbl> <chr> <int>
#> 1 1 C 3
#> 2 2 A 2
#> 3 3 B 3
CodePudding user response:
Unique labels are needed for each cases (grouping vars)
In this case, the joining variables (by = c("age', 'gender')) have duplicates. Therefore, we have to assign unique label to those duplicates in order to distinguish each cases that share the same key.
library(tidyverse)
df1 <- data.frame(family_id = c(1,1,1,2,2,3,3,3),
age = c(42,37,10,42,35,42,37,10),
gender= c("M","F","M","M","F","M","F","F"),
tenure = c(12,10,NA,15,9,9,10,NA)) %>%
group_by(age, gender) %>% # The age and gender columns have duplicates
mutate(rown = row_number()) # Therefore, we need a case-specific column
df2 <- data.frame(family_id = c("C","C","C","A","A","B","B","B"),
age = c(42,37,10,42,35,42,37,10),
gender= c("M","F","M","M","F","M","F","F"),
employed= c(0,1,NA,1,1,1,0,NA)) %>%
group_by(age, gender) %>%
mutate(rown = row_number())
df3 <- data.frame(family_id_x = c(1,1,1,2,2,3,3,3),
family_id_y = c("C","C","C","A","A","B","B","B"),
age = c(42,37,10,42,35,42,37,10),
gender= c("M","F","M","M","F","M","F","F"),
tenure = c(12,10,NA,15,9,9,10,NA),
employed= c(0,1,NA,1,1,1,0,NA))
# Join with three columns that are able to distinguish each cases
df4 <- left_join(df1, df2, by = c('age', 'gender', "rown"))
> df4
# A tibble: 8 x 7
# Groups: age, gender [5]
family_id.x age gender tenure rown family_id.y employed
<dbl> <dbl> <chr> <dbl> <int> <chr> <dbl>
1 1 42 M 12 1 C 0
2 1 37 F 10 1 C 1
3 1 10 M NA 1 C NA
4 2 42 M 15 2 A 1
5 2 35 F 9 1 A 1
6 3 42 M 9 3 B 1
7 3 37 F 10 2 B 0
8 3 10 F NA 1 B NA