Home > Software design >  "Temporarily" filtering dataset in R to create a new variable?
"Temporarily" filtering dataset in R to create a new variable?

Time:06-22

I am trying to capture the date of the laboratory assay that falls closest to:

  1. 30 days after a person was vaccinated (d3m1)
  2. 90 days after the date of their vaccination (d3m3), and
  3. 180 days after the date of their vaccination (d3m6).

E.g., if a person has 3 assays: one taken 27 days after vaccination, one 32 days after vaccination, and one 40 days after vaccination, I would like to use the one taken 32 days after vaccination as their "30-day assay," which I am calling d3m1.

Below is an example of my data frame:

     id      vax_date   assay_date   abs1  abs3   abs6   wpost1   wpost3   wpost6    
1     1     13oct2021    12oct2021     NA    NA     NA     NA       NA       NA
2     1     13oct2021    12nov2021      0    NA     NA      1       NA       NA
3     1     13oct2021    27oct2021     16    NA     NA      1       NA       NA
4     1     13oct2021    12jan2022     NA     1     NA     NA        1       NA
5     1     13oct2021    13apr2022     NA    NA      2     NA       NA        1

I have several variables in my dataset that are markers of whether an assay is eligible to serve as the 1-, 3-, or 6-month assay. These are called wpost1, wpost3, and wpost6.

I also have a variable that captures the absolute value of the number of days between each assay and the date each person received their vaccine. There is one such variable for the assays that correspond to 30 days after the date of the vaccination (abs1), one that corresponds to 90 days after vaccination (abs3), and one that corresponds to 180 days after vaccination (abs6). Abs1 was created by taking the absolute value of the number of days between vaccination and any assay that fell between 14 and 45 days after vaccination, minus 30 days: i.e.,

abs1 = case_when(between(assay_date - vax_date, 14,45) ~ abs(assay_date - vax_date) - 30) 

I am very new to data.table but wrote the following code after reading a few examples here on StackOverflow in order to capture the assay closest to 30 days after vaccination.

dt <- as.data.table(dt)
dt <- dt[order(id, abs1)]
dt <- dt[wpost1==1, d3m1 := assay_date[1], by = id]

In effect, I am simply ordering the dataset by abs1 after subsetting to rows where wpost1 = 1 and then selecting the assay date from the first row by id. I'm certain there are smarter ways to do this, but my code results in the following:

     id      vax_date   assay_date   abs1  abs3   abs6   wpost1   wpost3   wpost6   d3m1    
1     1     13oct2021    12oct2021     NA    NA     NA     NA       NA       NA       NA
2     1     13oct2021    12nov2021      0    NA     NA      1       NA       NA  12nov2021
3     1     13oct2021    27oct2021     16    NA     NA      1       NA       NA  12nov2021
4     1     13oct2021    12jan2022     NA     1     NA     NA        1       NA       NA
5     1     13oct2021    13apr2022     NA    NA      2     NA       NA        1       NA

There are two things that I would like to accomplish that I am struggling with: (1) the code I wrote places NAs in the cells of the rows for the new variable that did not meet the filter/subsetting criteria. In this example, the 'value' of d3m1 is NA for all rows where wpost1 != 1. However, I would like all cells to have the same value for d3m1. (2) I would like to turn this into a loop to avoid repeating the same code for d3m3 and d3m6 while also using that same loop to capture lab values, and dates of other tests which fall closest to 1-, 3-, and 6-months after vaccination.

I have a slight preference to do this in dplyr but am more than happy to learn data.table if it makes more sense here!

Any tips would be greatly appreciated!

CodePudding user response:

First, it will help to make your date columns into data formats so that we can calculate with them.

library(dplyr)
df2 <- df1 %>%
  mutate(across(ends_with("date"), lubridate::dmy))

Then I'd suggest making a function to find matches. It would need the data frame, the name of the flag_col that is 1, the desired days between vax and assay, and the name of the new column to hold the best matching date.

find_match <- function(df, flag_col, days = 30, output_col) {
  df %>%
    filter( {{flag_col}} == 1) %>%
    group_by(id, vax_date) %>%
    mutate(dif = assay_date - vax_date) %>%
    slice_min(abs(dif - days), with_ties = FALSE) %>%
    summarize( {{output_col}} := assay_date, .groups = "drop")
}

I use {{ }}, part of rlang's non-standard evaluation, to specify within the context of the formula which column needs to be 1 for the current search. The := operator allows us to specify the name of the new column we want to add.

By joining the data to the output of this function, we can add the three new columns.

df2 %>%
  left_join(find_match(df2, wpost1,  30, "d3m1")) %>%
  left_join(find_match(df2, wpost3,  60, "d3m3")) %>%
  left_join(find_match(df2, wpost6, 180, "d3m6")) 

Result

Joining, by = c("id", "vax_date")
Joining, by = c("id", "vax_date")
Joining, by = c("id", "vax_date")
  id   vax_date assay_date abs1 abs3 abs6 wpost1 wpost3 wpost6       d3m1       d3m3       d3m6
1  1 2021-10-13 2021-10-12   NA   NA   NA     NA     NA     NA 2021-11-12 2022-01-12 2022-04-13
2  1 2021-10-13 2021-11-12    0   NA   NA      1     NA     NA 2021-11-12 2022-01-12 2022-04-13
3  1 2021-10-13 2021-10-27   16   NA   NA      1     NA     NA 2021-11-12 2022-01-12 2022-04-13
4  1 2021-10-13 2022-01-12   NA    1   NA     NA      1     NA 2021-11-12 2022-01-12 2022-04-13
5  1 2021-10-13 2022-04-13   NA   NA    2     NA     NA      1 2021-11-12 2022-01-12 2022-04-13

Sample data

To produce this, I took the OP's table, pasted it into google sheets to convert into spreadsheet columns using Split Text into Columns and some manual tweaking, then pasted that back into R using the datapasta package.

df1 <- data.frame(
  stringsAsFactors = FALSE,
                id = c(1L, 1L, 1L, 1L, 1L),
                vax_date = c("13Oct2021",
                             "13Oct2021","13Oct2021","13Oct2021","13Oct2021"),
              assay_date = c("12Oct2021",
                             "12Nov2021","27Oct2021","12Jan2022","13Apr2022"),
              abs1 = c(NA, 0L, 16L, NA, NA),
              abs3 = c(NA, NA, NA, 1L, NA),
              abs6 = c(NA, NA, NA, NA, 2L),
            wpost1 = c(NA, 1L, 1L, NA, NA),
            wpost3 = c(NA, NA, NA, 1L, NA),
            wpost6 = c(NA, NA, NA, NA, 1L)
      )
  • Related