In the given dataset, case_control
indicates whether a row is a case
or control
, id
is an identifier which is unique for case
but it can be repeated for control
and group
indicates cluster. I need to select one control per case within each group
but if a control is previous selected for a case, it cannot be selected for the next case, based on the id
variable. If there are no available controls, the case will have to be dropped.
How can I achieve this to work quickly in a very large dataset with ~10 million rows (with 2 mil cases and 8 mil controls)?
Dataset looks like this(https://docs.google.com/spreadsheets/d/1MpjKv9Fm_Hagb11h_dqtDX4hV7G7sZrt/edit#gid=1801722229)
group case_control id
cluster_1 case 11
cluster_1 control 21
cluster_1 control 22
cluster_1 control 23
cluster_2 case 12
cluster_2 control 21
cluster_2 control 22
cluster_2 control 24
cluster_3 case 13
cluster_3 control 21
cluster_3 control 22
cluster_3 control 25
Expected output must look like this
group case_control id
cluster_1 case 11
cluster_1 control 21
cluster_2 case 12
cluster_2 control 22
cluster_3 case 13
cluster_3 control 25
CodePudding user response:
Here is a data.table approach.
The code can be shortened (a lot), but I choose to keep each step separated (and commented), so you can see what actions are taken and can inspect intermediate results.
library(data.table)
#initialise vector for used ids
id.used <- as.numeric()
#split by group and loop
L <- lapply(split(DT, by = "group"), function(x) {
#select first row
caserow <- x[1,]
#select second to last row
controlrow <- x[2:nrow(x), ]
#match against id's already in use
controlrow.new <- controlrow[!id %in% id.used, ]
#sample random row from id's not already used
controlrow.sample <- controlrow.new[controlrow.new[, .I[sample(.N, 1)], ]]
#fill id.used (be carefull with the use of <<- !! google why..)
id.used <<- c(id.used, controlrow.sample$id)
#rowbind the sampled row to the caserow
return(rbind(caserow, controlrow.sample))
})
# rowbind the list back together and cast to wide
dcast(rbindlist(L), group ~ case_control, value.var = "id")
# group case control
# 1: cluster_1 11 21
# 2: cluster_2 12 24
# 3: cluster_3 13 25
sample data used
DT <- fread("group case_control id
cluster_1 case 11
cluster_1 control 21
cluster_1 control 22
cluster_1 control 23
cluster_2 case 12
cluster_2 control 21
cluster_2 control 22
cluster_2 control 24
cluster_3 case 13
cluster_3 control 21
cluster_3 control 22
cluster_3 control 25")
CodePudding user response:
Base R:
Reduce(\(x,y)rbind(x, y[which(!y$id %in% x$id)[1:2], ]), split(df[-(3:4),], ~group))
group case_control id
1 cluster_1 case 11
2 cluster_1 control 21
5 cluster_2 case 12
7 cluster_2 control 22
9 cluster_3 case 13
12 cluster_3 control 25
Note that we just need the first case and the first non-duplicated control for each cluster, thus slicing 1:2
Tidyverse:
df %>%
slice(-(3:4))%>%
group_split(group) %>%
reduce(~rbind(.x, slice(anti_join(.y, .x, by = c("case_control", "id")), 1:2)))
# A tibble: 6 x 3
group case_control id
<chr> <chr> <int>
1 cluster_1 case 11
2 cluster_1 control 21
3 cluster_2 case 12
4 cluster_2 control 22
5 cluster_3 case 13
6 cluster_3 control 25