Let's say I have
> df
fu1_date fu1_n_symp fu5_date fu5_n_symp fu7_date fu7_n_symp
1 2012-03-05 1 2014-03-05 NA 2016-03-05 1
2 2013-08-09 1 2015-10-09 2 2017-11-09 NA
3 2019-05-05 1 2020-06-07 2 2021-07-09 2
df
denotes an extremely large dataframe. In this example, I have recorded the number of symptoms n_symp
on different follow-up dates fu_date
.
There are up to 20 follow-ups for each row in my dataframe, fu1_
, fu2_
, ... , fu20_
. I need to correct my dataframe, so that if n_symp
is NA
then the corresponding fuX_date
should be converted from as.Date()
to NA
.
You can see that row 1
had missing values in follow-up 5 (fu5_n_symp == NA
), but not FU1 or FU7. Consequently, fu5_date
in row 1 should be converted from 2014-03-05
to NA
I am looking for a solution in dplyr
only.
Expected output
> df
fu1_date fu1_n_symp fu5_date fu5_n_symp fu7_date fu7_n_symp
1 2012-03-05 1 <NA> NA 2016-03-05 1
2 2013-08-09 1 2015-10-09 2 <NA> NA
3 2019-05-05 1 2020-06-07 2 2021-07-09 2
Data
df <- structure(list(fu1_date = structure(c(15404, 15926, 18021), class = "Date"),
fu1_n_symp = c(1L, 1L, 1L), fu5_date = structure(c(16134,
16717, 18420), class = "Date"), fu5_n_symp = c(NA, 2L, 2L
), fu7_date = structure(c(16865, 17479, 18817), class = "Date"),
fu7_n_symp = c(1L, NA, 2L)), class = "data.frame", row.names = c(NA, -3L))
CodePudding user response:
Here is one way how we can do it, using pivoting:
library(dplyr)
library(tidyr)
df %>%
pivot_longer(everything(),
names_to = "key",
values_to = "val",
values_transform = list(val = as.character)) %>% # change all to character class
mutate(val = ifelse(is.na(lead(val, default = val[1])), NA_character_, val)) %>%
group_by(key) %>%
mutate(row = row_number()) %>%
pivot_wider(names_from= key, values_from = val) %>%
mutate(across(contains("date"), as.Date)) # to get back to date class
select(-row)
fu1_date fu1_n_symp fu5_date fu5_n_symp fu7_date fu7_n_symp
<chr> <chr> <chr> <chr> <chr> <chr>
1 2012-03-05 1 NA NA 2016-03-05 1
2 2013-08-09 1 2015-10-09 2 NA NA
3 2019-05-05 1 2020-06-07 2 2021-07-09 2
CodePudding user response:
With pivot_longer()
, you can specify ".value"
to names_to
to stack date
and n_symp
pairwise. In this case, one of names_sep
or names_pattern
must be supplied to specify how the column names should be split. Then you can easily replace those dates with NA
where n_symp
are missing. Finally, pivot the long data wider to get the original format.
library(dplyr)
library(tidyr)
df %>%
mutate(id = 1:n()) %>%
pivot_longer(-id, names_to = c("fu", ".value"), names_sep = "(?<=\\d)_") %>%
mutate(date = replace(date, is.na(n_symp), NA)) %>%
pivot_wider(names_from = fu, values_from = c(date, n_symp),
names_glue = "{fu}_{.value}", names_vary = "slowest")
# # A tibble: 3 × 7
# id fu1_date fu1_n_symp fu5_date fu5_n_symp fu7_date fu7_n_symp
# <int> <date> <int> <date> <int> <date> <int>
# 1 1 2012-03-05 1 NA NA 2016-03-05 1
# 2 2 2013-08-09 1 2015-10-09 2 NA NA
# 3 3 2019-05-05 1 2020-06-07 2 2021-07-09 2
names_vary
in pivot_wider()
controls in what order should the resulting column names be combined.
"fastest" (default)
fu1_date fu5_date fu7_date fu1_n_symp fu5_n_symp fu7_n_symp
"slowest"
fu1_date fu1_n_symp fu5_date fu5_n_symp fu7_date fu7_n_symp