I have a dataset that looks like this:
patid age gender group pracid matched_id match_eventdate BMI
1 10 M case 100 1 23-05-20 NA
111 12 M control 100 1 23-05-20 20.8
222 9 M case 100 222 23-05-20 15.7
333 8 M control 100 222 23-05-20 21.8
555 8 M control 100 222 23-05-20 19.5
Each case can have up to 3 controls (some will have 1, some 2, some 3). Say, I need to cases that doesn't have BMI recorded(e.g. patid 1).I need to remove the corresponding controls with 1 (patid 111). It can be any number (not 111 as in the example above). How would I do that?
I know I need a for loop to go through the BMI, then save the ID cases that don't match that criteria, then remove those and corresponding controls.
CodePudding user response:
This is a two-step process, but it does not involve loops. I’m using the ‘dplyr’ package in the following. There are other solutions.
First, you identify which cases you want to remove. In this case, those where BMI
is NA:
excluded_patients = data |>
filter(group == 'case', is.na(BMI)) |>
pull(patid)
And the second step is to exclude those patients from the data:
filtered_data = data |>
filter(patid %in% excluded_patients)
Or maybe you need the following (it isn’t clear from your question):
filtered_data = data |>
filter(matched_id %in% excluded_patients)
CodePudding user response:
If I understand you correctly, you want to remove all cases and controls when a case has a missing BMI value (NA
). You can do this simply in base R by indexing on those conditions.
Code
df[!(df$matched_id %in% df$patid[is.na(df$BMI)]),]
# patid age gender group pracid matched_id match_eventdate BMI
# 6 222 9 M case 100 222 23-05-20 15.7
# 7 333 8 M control 100 222 23-05-20 21.8
# 8 555 8 M control 100 222 23-05-20 19.5
Data - note I am expanding your dataset a bit to include an extra control for patid == 1
and also an additional case with patient ID "5" to ensure validity.
df <-read.table(text = " patid age gender group pracid matched_id match_eventdate BMI
1 10 M case 100 1 23-05-20 NA
111 12 M control 100 1 23-05-20 20.8
111 12 M control 100 1 23-05-20 17.8
5 50 M case 500 5 23-05-20 NA
585 52 M control 500 5 23-05-20 20.8
222 9 M case 100 222 23-05-20 15.7
333 8 M control 100 222 23-05-20 21.8
555 8 M control 100 222 23-05-20 19.5", header = TRUE)
If I misunderstood and this is not the output you want, let me know and I can modify my answer. Good luck!