I am working with some survey data and I would like to replace the contents of one survey item/column with another survey item, while keeping original cell contents. Ex - replace Q2_1.x with Q2_1.y if Q2_1.x is missing (missing coded as "-99" or coded as character_NA).
Here is an example of my data:
ibrary(dplyr)
library(magrittr)
ibrary(readr)
org_dat <- read_table('ID Q2_1.x Q2_2.x Q2_1.y Q2_2.y Q14_1.x Q14_1.y Q15
1 Yes NA NA NA Sometimes NA NA
2 -99 NA No NA NA Always Yes
3 Yes NA Yes NA NA NA NA
4 -99 NA NA No NA Yes No
5 NA -99 NA NA NA Always NA
6 -99 NA NA No NA NA NA') %>% mutate_all(as.character)
Here is my desired output:
dat_out <- read_table('ID Q2_1 Q2_2 Q14_1 Q15
1 Yes NA Sometimes NA
2 No NA Always Yes
3 Yes NA NA NA
4 -99 No Yes No
5 NA -99 Always NA
6 -99 No NA NA')
Current solution I know that I can replace each of these columns individually, but I have a lot of columns to deal with and I would like to use a smart dplyr/grepl way of solving this! Any ideas? It is always the case that I am replacing the Q*.x with the Q*.y.
org_dat %>% mutate(Q2_1.x = case_when(!is.na(Q2_1.y) &
Q2_1.x == '-99'| is.na(Q2_1.x) ~ Q2_1.y,
TRUE ~ Q2_1.x)) %>%
mutate(Q2_2.x = case_when(!is.na(Q2_2.y) &
Q2_2.x == '-99'| is.na(Q2_2.x) ~ Q2_2.y,
TRUE ~ Q2_2.x)) %>%
mutate(Q14_1.x = case_when(!is.na(Q14_1.y) &
Q14_1.x == '-99'| is.na(Q14_1.x) ~ Q14_1.y,
TRUE ~ Q14_1.x)) %>%
rename(Q2_1 = Q2_1.x,
Q2_2 = Q2_2.x,
Q14_1 = Q14_1.x) %>%
select(-matches("x|y"))
CodePudding user response:
The key to the answer here is to first translate the user-defined NAs into real nas with na_if
, followed by coalesce
with paired columns.
library(dplyr)
library(stringr)
org_dat %>%
na_if(-99) %>%
mutate(across(ends_with('.x'),
~coalesce(.x, get(deparse(substitute(.x)) %>%
str_replace('\\.x', '.y'))))) %>%
select(-ends_with('.y')) %>%
rename_with(~str_remove(.x, '\\..$'))
# A tibble: 6 × 5
ID Q2_1 Q2_2 Q14_1 Q15
<chr> <chr> <chr> <chr> <chr>
1 1 Yes NA Sometimes NA
2 2 No NA Always Yes
3 3 Yes NA NA NA
4 4 NA No Yes No
5 5 NA NA Always NA
6 6 NA No NA NA
EDIT
The original answer did not provide the actual desired output, because it replaced all user-defined NAs (-99) with NAs.
If the OP wants to preserve these user defined NAs, We can do as follows:
First, change all columns to character.
Second, split the data.frame into dataframes paired by the prefix "Q{number}_{number}" with split.default
.
finally, modify
all list elements with two columns ('x' and 'y' pairs) with modify_if
and coalesce
.
library(dplyr)
library(purrr)
org_dat %>%
mutate(across(everything(), as.character)) %>%
split.default(sub('\\..$', '', names(org_dat))) %>%
modify_if(.p=~ncol(.x)==2, .f = ~coalesce(.x[[1]], .x[[2]])) %>%
bind_cols() %>%
select(ID, Q2_1, Q2_2, Q14_1, Q15)
# A tibble: 6 × 5
ID Q2_1 Q2_2 Q14_1 Q15
<chr> <chr> <chr> <chr> <chr>
1 1 Yes NA Sometimes NA
2 2 -99 NA Always Yes
3 3 Yes NA NA NA
4 4 -99 No Yes No
5 5 NA -99 Always NA
6 6 -99 No NA NA
CodePudding user response:
A shorter approach. Just strip all ".x"
suffixes off the variable names and then transmute variables whose names are not "ID"
or do not end with ".y"
. For each of those variables, get0
the counterpart with the ".y"
suffix and do the replacement as follows. Note that if there is no counterpart with a ".y"
suffix, get0
returns NULL
and idx
thus collapses to integer(0)
. As a result, the variable is returned as-is.
library(dplyr)
ord_dat %>%
rename_with(~sub("\\.x", "", .), ends_with(".x")) %>%
transmute(ID, across(!ID & !ends_with(".y"), ~{
.y <- get0(paste0(cur_column(), ".y"))
idx <- which(.x %in% c("-99", NA) & !is.na(.y))
replace(.x, idx, .y[idx])
}))
Here is a tidyverse approach. We would
pivot_longer
the dataset into four columns:ID
,Q
(uestion),x
, andy
.- Replace any values in
x
at positions wherex
showsNA
or"-99"
buty
shows a non-NA
value. - Remove column
y
andpivot_wider
the dataset with names fromQ
and values fromx
.
The steps are not complicated, but they do require granular controls of the functions. This makes the code a bit long:
library(dplyr)
library(tidyr)
ord_dat %>%
pivot_longer(
-ID,
names_to = c("Q", ".value"),
names_pattern = "([^\\.]*)\\.?([^\\.]*)",
names_transform = list(.value = ~replace(., . == "", "x"))
) %>%
mutate(x = {
idx <- which(x %in% c("-99", NA) & !is.na(y))
replace(x, idx, y[idx])
}, y = NULL) %>%
pivot_wider(names_from = Q, values_from = x)
The output
# A tibble: 6 x 5
ID Q2_1 Q2_2 Q14_1 Q15
<chr> <chr> <chr> <chr> <chr>
1 1 Yes NA Sometimes NA
2 2 No NA Always Yes
3 3 Yes NA NA NA
4 4 -99 No Yes No
5 5 NA -99 Always NA
6 6 -99 No NA NA