Home > Software design >  Convert dates to NA depending on valid values corresponding to that specific date
Convert dates to NA depending on valid values corresponding to that specific date

Time:12-31

Let's say I have

> df
    fu1_date fu1_n_symp   fu5_date fu5_n_symp   fu7_date fu7_n_symp
1 2012-03-05          1 2014-03-05         NA 2016-03-05          1
2 2013-08-09          1 2015-10-09          2 2017-11-09         NA
3 2019-05-05          1 2020-06-07          2 2021-07-09          2

df denotes an extremely large dataframe. In this example, I have recorded the number of symptoms n_symp on different follow-up dates fu_date.

There are up to 20 follow-ups for each row in my dataframe, fu1_, fu2_, ... , fu20_. I need to correct my dataframe, so that if n_symp is NA then the corresponding fuX_date should be converted from as.Date() to NA.

You can see that row 1 had missing values in follow-up 5 (fu5_n_symp == NA), but not FU1 or FU7. Consequently, fu5_date in row 1 should be converted from 2014-03-05 to NA

I am looking for a solution in dplyr only.

Expected output

> df
    fu1_date fu1_n_symp   fu5_date fu5_n_symp   fu7_date fu7_n_symp
1 2012-03-05          1       <NA>         NA 2016-03-05          1
2 2013-08-09          1 2015-10-09          2       <NA>         NA
3 2019-05-05          1 2020-06-07          2 2021-07-09          2

Data

df <- structure(list(fu1_date = structure(c(15404, 15926, 18021), class = "Date"), 
    fu1_n_symp = c(1L, 1L, 1L), fu5_date = structure(c(16134, 
    16717, 18420), class = "Date"), fu5_n_symp = c(NA, 2L, 2L
    ), fu7_date = structure(c(16865, 17479, 18817), class = "Date"), 
    fu7_n_symp = c(1L, NA, 2L)), class = "data.frame", row.names = c(NA, -3L))

CodePudding user response:

Here is one way how we can do it, using pivoting:

library(dplyr)
library(tidyr)

df %>% 
  pivot_longer(everything(),
               names_to = "key",
               values_to = "val", 
               values_transform = list(val = as.character)) %>% # change all to character class
  mutate(val = ifelse(is.na(lead(val, default = val[1])), NA_character_, val)) %>% 
  group_by(key) %>% 
  mutate(row = row_number()) %>% 
  pivot_wider(names_from= key, values_from = val) %>% 
  mutate(across(contains("date"), as.Date)) # to get back to date class
  select(-row)
  fu1_date   fu1_n_symp fu5_date   fu5_n_symp fu7_date   fu7_n_symp
  <chr>      <chr>      <chr>      <chr>      <chr>      <chr>     
1 2012-03-05 1          NA         NA         2016-03-05 1         
2 2013-08-09 1          2015-10-09 2          NA         NA        
3 2019-05-05 1          2020-06-07 2          2021-07-09 2 

CodePudding user response:

With pivot_longer(), you can specify ".value" to names_to to stack date and n_symp pairwise. In this case, one of names_sep or names_pattern must be supplied to specify how the column names should be split. Then you can easily replace those dates with NA where n_symp are missing. Finally, pivot the long data wider to get the original format.

library(dplyr)
library(tidyr)

df %>%
  mutate(id = 1:n()) %>% 
  pivot_longer(-id, names_to = c("fu", ".value"), names_sep = "(?<=\\d)_") %>%
  mutate(date = replace(date, is.na(n_symp), NA)) %>%
  pivot_wider(names_from = fu, values_from = c(date, n_symp),
              names_glue = "{fu}_{.value}", names_vary = "slowest")

# # A tibble: 3 × 7
#      id fu1_date   fu1_n_symp fu5_date   fu5_n_symp fu7_date   fu7_n_symp
#   <int> <date>          <int> <date>          <int> <date>          <int>
# 1     1 2012-03-05          1 NA                 NA 2016-03-05          1
# 2     2 2013-08-09          1 2015-10-09          2 NA                 NA
# 3     3 2019-05-05          1 2020-06-07          2 2021-07-09          2

names_vary in pivot_wider() controls in what order should the resulting column names be combined.

  • "fastest" (default)

    fu1_date   fu5_date   fu7_date   fu1_n_symp fu5_n_symp fu7_n_symp
    
  • "slowest"

    fu1_date   fu1_n_symp fu5_date   fu5_n_symp fu7_date   fu7_n_symp
    
  • Related