Home > front end >  finding/comparing dates when sometimes only NA's are present
finding/comparing dates when sometimes only NA's are present

Time:08-07

I'm both new to coding in R (always used SPSS but have to use R for a project) and this website, so bear with me. Hopefully I'm both able to explain the problem and what I've tried.

My data looks somewhat like this:

    df <- data.frame (
        ID = c(1, 1, 1, 2, 2, 2, 2, 3, 3),
        measurement = c (1, 2, 3, 1, 2, 3, 4, 1, 2),
        date_event1 = c(NA, NA, "2021-02-15", NA, NA, NA, "2021-03-01", NA, NA),
        date_event2 = c(NA, NA, NA, NA, "2021-03-06", NA, NA, "2022-02-02", "2022-02-02")
      )

    df
ID measurement  date_event1 date_event2
1           1               <NA>        <NA>
1           2               <NA>        <NA>
1           3         2021-02-15        <NA>
2           1               <NA>        <NA>
2           2               <NA>  2021-03-06
2           3               <NA>        <NA>
2           4         2021-03-01        <NA>
3           1               <NA>  2022-02-02
3           2               <NA>  2022-02-02

I have patients (identified by ID) with a variable number of measurements (identified by measurement number and their date, so long-format data) and events (coded here as 'event1' and 'event2'). Events can be present for a particular patient and measurement (then coded with the date it occurred) or absent (then coded as NA).

Ultimately, my goal is to calculate intervals (in days) between two first events, if two are present. If no or only 1 event took place, the result should be NA. Desired output should look something like this:

ID measurement  date_event1 date_event2 interval
1           1               <NA>        <NA>       NA
1           2               <NA>        <NA>       NA
1           3         2021-02-15        <NA>       NA
2           1               <NA>        <NA>       **5**
2           2               <NA>  2021-03-06       **5**
2           3               <NA>        <NA>       **5**
2           4         2021-03-01        <NA>        5
3           1               <NA>  2022-02-02       **NA**
3           2               <NA>  2022-02-02       NA

Main issues here are:

  • finding the first event using functions as 'min' will error if NA's are present.
  • using the 'min(x, na.rm=TRUE)' as a solution doesn't work if IDs with only NA's are present

What I've tried:

   df <- df %>%
      group_by(ID) %>%
        arrange(ID, measurement) %>%
# creating 2 identifier variables if all rows for event1/event2 are NA
        mutate(allNA1 = ifelse(all(is.na(date_event1)), 1, 0)) %>%
        mutate(allNA2 = ifelse(all(is.na(date_event2)), 1, 0)) %>%
      ungroup()

# for simplicity, combining these two identifier variables into 1 
    df$test <- ifelse(df$allNA1 == 0 & df$allNA2 == 0, 1, NA)
 
# then using this combined identifier variable to only use mindate on IDs that have at least 1 event  
    df <- df %>%
      group_by(ID) %>%
      mutate(mindate = if_else(test == 1, min(date_event1, na.rm=TRUE), NA_real_)) %>%
      ungroup ()

I haven't gotten to the comparing-dates step as finding the first date still produces the 'no non-missing arguments to min; returning Inf' warnings, even though I'm only mutating if test==1. What am I missing here? Are there easier solutions to my main problem? Thank you in advance!

Edit: forgot to add that the FIRST event should be used. Changes highlighted in bold.

Edit 2: made an error in the example, changed dates and intervals, also removed a column for simplicity.

Edit 3: Harre suggested to suppress warnings. Using the following did not work:

options(warn = -1)
df <- df %>%
    group_by(ID) %>%
    mutate(interval = abs(min(date_event1, na.rm=TRUE) - min(date_event2, na.rm=TRUE))) %>%
    ungroup()
options(warn = 1)

Edit 6/8/22: Okay, so I managed to circumvent the problem:

df <- df %>%
     group_by(ID) %>%
     # first recoding all missing values to an extremely late date
        mutate(date_event1 = if_else(is.na(date_event1), as.Date("2099-09-09"), date_event1)) %>%
        # then finding the earliest date (which are the non-2099-dates, if one was present)
        mutate(min_date_event1 = min(date_event1)) %>%
        # then recoding the 2099-dates back to NA
        mutate(min_date_event1= if_else(mindate == '2099-09-09', as.Date(NA_real_), mindate)) %>%

Which feels really inefficient for something I could do with 1 function in SPSS (AGGREGATE > FIRST). I'll checkmark my question but if anyone has an easier solution, feel free to add suggestions!

CodePudding user response:

Given that your "goal is to calculate intervals (in days) between two events, if two are present. If no or only 1 event took place, the result should be NA", there is no need for anything else than converting to the date-type:

library(dplyr)

df |>
  mutate(interval = abs(as.Date(date_event2) - as.Date(date_event1)))

or

library(dplyr)

df |>
  mutate(across(starts_with("date"), as.Date),
         interval = abs(date_event2 - date_event1))

Output:

  ID measurement date_measurement date_event1 date_event2 interval
1  1           1       2020-01-01        <NA>        <NA>  NA days
2  1           2       2020-01-05        <NA>        <NA>  NA days
3  1           3       2020-01-10  2021-02-15        <NA>  NA days
4  2           1       2021-02-01        <NA>  2021-03-01  NA days
5  2           2       2021-02-15        <NA>  2021-03-05  NA days
6  2           3       2021-03-01        <NA>        <NA>  NA days
7  2           4       2021-04-01  2021-03-01  2021-03-06   5 days
8  3           1       2022-01-01        <NA>  2022-02-02  NA days
9  3           2       2022-03-01        <NA>        <NA>  NA days

Update:

I this case you'll just want to suppress the warning as it doesn't influence the result (shown below). Alternatively you could arrange by the date and pick the first using first() no matter if it's NA or not.

df |>
    group_by(ID) |>
    mutate(across(starts_with("date"), as.Date),
           across(starts_with("date_event"), ~ suppressWarnings(min(., na.rm = TRUE)), .names = "{.col}_min"),
           interval = na_if(abs(date_event2_min - date_event1_min), Inf)) |>
    ungroup()

Output:

# A tibble: 9 × 7
     ID measurement date_event1 date_event2 date_event1_min date_event2_min interval
  <dbl>       <dbl> <date>      <date>      <date>          <date>          <drtn>  
1     1           1 NA          NA          2021-02-15      NA              NA days 
2     1           2 NA          NA          2021-02-15      NA              NA days 
3     1           3 2021-02-15  NA          2021-02-15      NA              NA days 
4     2           1 NA          NA          2021-03-01      2021-03-06       5 days 
5     2           2 NA          2021-03-06  2021-03-01      2021-03-06       5 days 
6     2           3 NA          NA          2021-03-01      2021-03-06       5 days 
7     2           4 2021-03-01  NA          2021-03-01      2021-03-06       5 days 
8     3           1 NA          2022-02-02  NA              2022-02-02      NA days 
9     3           2 NA          2022-02-02  NA              2022-02-02      NA days 
  • Related