I have two dataframes. The first consists of data from 10 individuals, while the second consists in 1000. Variables are the same between dataframes:
set.seed(2022)
SmallCohort <- data.frame("ID" = paste0("S00", sample(1000:9999, 10)),
"Sex" = sample(c("M", "F")),
"Age" = sample(20:30, 10, replace = TRUE),
"Smoke" = sample(c("Y","N")),
"Disease" = sample(c("Y","N")))
BigCohort <- data.frame("Subj" = paste0("B00", sample(1000:9999, 1000)),
"Sex" = sample(c("M", "F")),
"Age" = sample(20:30, 1000, replace = TRUE),
"Smoke" = sample(c("Y","N")),
"Disease" = sample(c("Y","N")))
My goal is to find paired individuals from the BigCohort
. Basically, I need to choose 10 unique individuals from the BigCohort
paired for Sex
, Age
, Smoke
and Disease
with the same number of individuals from the small cohort.
I'm not looking for propensity score matching.
I tried this:
library(dplyr)
Combined <- inner_join(SmallCohort, BigCohort)
CodePudding user response:
I just did something similar yesterday! A simple loop and some tidyverse functions get the job done.
First, you are not using sample
correctly in your example data, so let's correct that first:
set.seed(2022)
SmallCohort <- data.frame("ID" = paste0("S00", sample(1000:9999, 10)),
"Sex" = sample(c("M", "F")),
"Age" = sample(20:30, 10, replace = TRUE),
"Smoke" = sample(c("Y","N"), 10, replace = T),
"Disease" = sample(c("Y","N"), 10, replace = T))
BigCohort <- data.frame("Subj" = paste0("B00", sample(1000:9999, 1000)),
"Sex" = sample(c("M", "F")),
"Age" = sample(20:30, 1000, replace = TRUE),
"Smoke" = sample(c("Y","N"), 1000, replace = T),
"Disease" = sample(c("Y","N"), 1000, replace = T))
For every person in SmallCohort
, the loop checks which people in BigCohort
are exact matches for your chosen attributes (Sex, Age, Smoke, Disease). The first eligible matching person from BigCohort
is chosen, if they haven't already been matched. This loop adds a column to SmallCohort
called "BigMatch", which stores the appropriate match from BigCohort (this tracks the matches and stores the end result). You can also use this to pull the list of BigCohort matches as well.
library(tidyverse)
SmallCohort$BigMatch <- NA
for (i in 1:nrow(SmallCohort)) {
small.attributes <- select(SmallCohort[i, ], -ID, -BigMatch)
big_matches <- BigCohort %>%
mutate(across(-Subj, ~.x == small.attributes[[cur_column()]])) %>%
filter(if_all(-Subj, ~. == T) & !(Subj %in% SmallCohort$BigMatch))
SmallCohort$BigMatch[i] <- big_matches$Subj[1]
}
BigCohortMatched <- filter(BigCohort, Subj %in% SmallCohort$BigMatch)
SmallCohort
ID Sex Age Smoke Disease BigMatch
1 S005323 M 21 Y Y B007379
2 S008885 F 26 N N B009542
3 S003870 M 25 Y Y B008244
4 S009899 F 24 N N B004840
5 S005869 M 20 N N B009496
6 S003750 F 22 N N B005087
7 S009998 M 20 Y Y B008961
8 S001475 F 21 Y N B006545
9 S001122 M 29 N Y B007727
10 S002271 F 26 Y N B008913
BigCohortMatched
Subj Sex Age Smoke Disease
1 B007727 M 29 N Y
2 B005087 F 22 N N
3 B009542 F 26 N N
4 B008961 M 20 Y Y
5 B008244 M 25 Y Y
6 B006545 F 21 Y N
7 B008913 F 26 Y N
8 B009496 M 20 N N
9 B004840 F 24 N N
10 B007379 M 21 Y Y