Removing cases and corresponding controls-CodePudding

I have a dataset that looks like this:

     patid   age gender  group     pracid  matched_id match_eventdate  BMI
     1       10      M   case      100     1              23-05-20     NA
     111     12      M   control   100     1              23-05-20     20.8
     222     9       M   case      100     222            23-05-20     15.7
     333     8       M   control   100     222            23-05-20     21.8
     555     8       M   control   100     222            23-05-20     19.5

Each case can have up to 3 controls (some will have 1, some 2, some 3). Say, I need to cases that doesn't have BMI recorded(e.g. patid 1).I need to remove the corresponding controls with 1 (patid 111). It can be any number (not 111 as in the example above). How would I do that?

I know I need a for loop to go through the BMI, then save the ID cases that don't match that criteria, then remove those and corresponding controls.

CodePudding user response：

This is a two-step process, but it does not involve loops. I’m using the ‘dplyr’ package in the following. There are other solutions.

First, you identify which cases you want to remove. In this case, those where BMI is NA:

excluded_patients = data |>
    filter(group == 'case', is.na(BMI)) |>
    pull(patid)

And the second step is to exclude those patients from the data:

filtered_data = data |>
    filter(patid %in% excluded_patients)

Or maybe you need the following (it isn’t clear from your question):

filtered_data = data |>
    filter(matched_id %in% excluded_patients)

CodePudding user response：

If I understand you correctly, you want to remove all cases and controls when a case has a missing BMI value (NA). You can do this simply in base R by indexing on those conditions.

Code

df[!(df$matched_id %in% df$patid[is.na(df$BMI)]),]

#   patid age gender   group pracid matched_id match_eventdate  BMI
# 6   222   9      M    case    100        222        23-05-20 15.7
# 7   333   8      M control    100        222        23-05-20 21.8
# 8   555   8      M control    100        222        23-05-20 19.5

Data - note I am expanding your dataset a bit to include an extra control for patid == 1 and also an additional case with patient ID "5" to ensure validity.

df <-read.table(text = " patid   age gender  group     pracid  matched_id match_eventdate  BMI
     1       10      M   case      100     1              23-05-20     NA
     111     12      M   control   100     1              23-05-20     20.8
     111     12      M   control   100     1              23-05-20     17.8
     5       50      M   case      500     5              23-05-20     NA
     585     52      M   control   500     5              23-05-20     20.8
     222     9       M   case      100     222            23-05-20     15.7
     333     8       M   control   100     222            23-05-20     21.8
     555     8       M   control   100     222            23-05-20     19.5", header = TRUE)

If I misunderstood and this is not the output you want, let me know and I can modify my answer. Good luck!