I have two almost identical dataframes containing the same people (df_A and df_B). I would now like to check for each person how many values in df_A and df_B match (e.g., Person 1 has 3 identical values in df_A and df_B, whereas Person 4 has 2 identical values).
I would like to create new variables that contain the information about the number of matching values.
df_A and df_B could look like this:
df_A <- read.table(text=
"ID Var_1 Var_2 Var_3 Var_4 Var_5 Var_6
1 1 NA NA 1 NA 1
2 NA NA NA 1 1 1
3 NA 1 1 NA NA 1
4 1 1 NA NA 1 NA
5 NA NA NA 1 1 1", header=TRUE)
df_B <- read.table(text=
"ID Var_1 Var_2 Var_3 Var_4 Var_5 Var_6
1 1 NA NA 1 NA 1
2 NA NA NA 1 1 1
3 1 NA 1 1 NA NA
4 1 1 1 NA NA NA
5 1 1 1 NA NA NA", header=TRUE)
Ideally, the end result would look like this:
df_C <- read.table(text=
"ID Matches
1 3
2 3
3 1
4 2
5 0", header=TRUE)
Do you have any ideas on how achieve this most efficiently using R? I'm relatively new to R and would like to learn how to solve such problems without lengthy code. Thanks for your hints!
CodePudding user response:
Here's an idea.
library(dplyr)
library(tidyr)
left_join(df_A, df_B, by = 'ID') %>%
pivot_longer(-ID, names_pattern = '(.*).[xy]') %>%
group_by(ID, name) %>%
summarise(matches = !any(is.na(value)) & n_distinct(value, na.rm = TRUE)) %>%
summarise(matches = sum(matches))
#> # A tibble: 5 × 2
#> ID matches
#> <int> <int>
#> 1 1 3
#> 2 2 3
#> 3 3 1
#> 4 4 2
#> 5 5 0