Pairing individuals from two dataframes-CodePudding

I have two dataframes. The first consists of data from 10 individuals, while the second consists in 1000. Variables are the same between dataframes:

set.seed(2022)
SmallCohort <- data.frame("ID" = paste0("S00", sample(1000:9999, 10)),
                          "Sex" = sample(c("M", "F")),
                          "Age" = sample(20:30, 10, replace = TRUE),
                          "Smoke" = sample(c("Y","N")),
                          "Disease" = sample(c("Y","N")))

BigCohort <- data.frame("Subj" = paste0("B00", sample(1000:9999, 1000)),
                        "Sex" = sample(c("M", "F")),
                        "Age" = sample(20:30, 1000, replace = TRUE),
                        "Smoke" = sample(c("Y","N")),
                        "Disease" = sample(c("Y","N")))

My goal is to find paired individuals from the BigCohort. Basically, I need to choose 10 unique individuals from the BigCohort paired for Sex, Age, Smoke and Disease with the same number of individuals from the small cohort.

I'm not looking for propensity score matching.

I tried this:

library(dplyr)
Combined <- inner_join(SmallCohort, BigCohort)

CodePudding user response：

I just did something similar yesterday! A simple loop and some tidyverse functions get the job done.

First, you are not using sample correctly in your example data, so let's correct that first:

set.seed(2022)
SmallCohort <- data.frame("ID" = paste0("S00", sample(1000:9999, 10)),
                          "Sex" = sample(c("M", "F")),
                          "Age" = sample(20:30, 10, replace = TRUE),
                          "Smoke" = sample(c("Y","N"), 10, replace = T),
                          "Disease" = sample(c("Y","N"), 10, replace = T))

BigCohort <- data.frame("Subj" = paste0("B00", sample(1000:9999, 1000)),
                        "Sex" = sample(c("M", "F")),
                        "Age" = sample(20:30, 1000, replace = TRUE),
                        "Smoke" = sample(c("Y","N"), 1000, replace = T),
                        "Disease" = sample(c("Y","N"), 1000, replace = T))

For every person in SmallCohort, the loop checks which people in BigCohort are exact matches for your chosen attributes (Sex, Age, Smoke, Disease). The first eligible matching person from BigCohort is chosen, if they haven't already been matched. This loop adds a column to SmallCohort called "BigMatch", which stores the appropriate match from BigCohort (this tracks the matches and stores the end result). You can also use this to pull the list of BigCohort matches as well.

library(tidyverse)

SmallCohort$BigMatch <- NA
for (i in 1:nrow(SmallCohort)) {
  
  small.attributes <- select(SmallCohort[i, ], -ID, -BigMatch)
  
  big_matches <- BigCohort %>% 
    mutate(across(-Subj, ~.x == small.attributes[[cur_column()]])) %>% 
    filter(if_all(-Subj, ~. == T) & !(Subj %in% SmallCohort$BigMatch)) 
  
  SmallCohort$BigMatch[i] <- big_matches$Subj[1]
}

BigCohortMatched <- filter(BigCohort, Subj %in% SmallCohort$BigMatch)

SmallCohort
        ID Sex Age Smoke Disease BigMatch
1  S005323   M  21     Y       Y  B007379
2  S008885   F  26     N       N  B009542
3  S003870   M  25     Y       Y  B008244
4  S009899   F  24     N       N  B004840
5  S005869   M  20     N       N  B009496
6  S003750   F  22     N       N  B005087
7  S009998   M  20     Y       Y  B008961
8  S001475   F  21     Y       N  B006545
9  S001122   M  29     N       Y  B007727
10 S002271   F  26     Y       N  B008913

BigCohortMatched
      Subj Sex Age Smoke Disease
1  B007727   M  29     N       Y
2  B005087   F  22     N       N
3  B009542   F  26     N       N
4  B008961   M  20     Y       Y
5  B008244   M  25     Y       Y
6  B006545   F  21     Y       N
7  B008913   F  26     Y       N
8  B009496   M  20     N       N
9  B004840   F  24     N       N
10 B007379   M  21     Y       Y