Replace content in one column with another column R-CodePudding

I am working with some survey data and I would like to replace the contents of one survey item/column with another survey item, while keeping original cell contents. Ex - replace Q2_1.x with Q2_1.y if Q2_1.x is missing (missing coded as "-99" or coded as character_NA).

Here is an example of my data:

ibrary(dplyr)
library(magrittr)
ibrary(readr)

org_dat <- read_table('ID   Q2_1.x  Q2_2.x  Q2_1.y  Q2_2.y  Q14_1.x Q14_1.y Q15
    1   Yes NA  NA  NA  Sometimes   NA  NA
    2   -99 NA  No  NA  NA  Always  Yes
    3   Yes NA  Yes NA  NA  NA  NA
    4   -99 NA  NA  No  NA  Yes No 
    5   NA  -99 NA  NA  NA  Always  NA
    6   -99 NA  NA  No  NA  NA  NA') %>% mutate_all(as.character)

Here is my desired output:

dat_out <- read_table('ID   Q2_1    Q2_2    Q14_1   Q15
1   Yes NA  Sometimes   NA
2   No  NA  Always  Yes
3   Yes NA  NA  NA
4   -99 No  Yes No
5   NA  -99 Always  NA
6   -99 No  NA  NA')

Current solution I know that I can replace each of these columns individually, but I have a lot of columns to deal with and I would like to use a smart dplyr/grepl way of solving this! Any ideas? It is always the case that I am replacing the Q*.x with the Q*.y.

org_dat %>% mutate(Q2_1.x = case_when(!is.na(Q2_1.y) &
                                        Q2_1.x == '-99'| is.na(Q2_1.x) ~ Q2_1.y,
                                      TRUE ~ Q2_1.x)) %>%
mutate(Q2_2.x = case_when(!is.na(Q2_2.y) &
                            Q2_2.x == '-99'| is.na(Q2_2.x) ~ Q2_2.y,
                          TRUE ~ Q2_2.x)) %>% 
  
  mutate(Q14_1.x = case_when(!is.na(Q14_1.y) &
                              Q14_1.x == '-99'| is.na(Q14_1.x) ~ Q14_1.y,
                            TRUE ~ Q14_1.x)) %>%
  rename(Q2_1 = Q2_1.x,
         Q2_2 = Q2_2.x,
         Q14_1 = Q14_1.x) %>%
  select(-matches("x|y"))

CodePudding user response：

The key to the answer here is to first translate the user-defined NAs into real nas with na_if, followed by coalesce with paired columns.

library(dplyr)
library(stringr)
org_dat %>%
    na_if(-99) %>%
    mutate(across(ends_with('.x'),
                  ~coalesce(.x, get(deparse(substitute(.x)) %>%
                                        str_replace('\\.x', '.y'))))) %>%
    select(-ends_with('.y')) %>%
    rename_with(~str_remove(.x, '\\..$'))


# A tibble: 6 × 5
  ID    Q2_1  Q2_2  Q14_1     Q15  
  <chr> <chr> <chr> <chr>     <chr>
1 1     Yes   NA    Sometimes NA   
2 2     No    NA    Always    Yes  
3 3     Yes   NA    NA        NA   
4 4     NA    No    Yes       No   
5 5     NA    NA    Always    NA   
6 6     NA    No    NA        NA

EDIT

The original answer did not provide the actual desired output, because it replaced all user-defined NAs (-99) with NAs.

If the OP wants to preserve these user defined NAs, We can do as follows: First, change all columns to character. Second, split the data.frame into dataframes paired by the prefix "Q{number}_{number}" with split.default. finally, modify all list elements with two columns ('x' and 'y' pairs) with modify_if and coalesce.

library(dplyr)
library(purrr)

org_dat %>%
    mutate(across(everything(), as.character)) %>%
        split.default(sub('\\..$', '', names(org_dat))) %>%
        modify_if(.p=~ncol(.x)==2, .f = ~coalesce(.x[[1]], .x[[2]])) %>%
        bind_cols() %>%
        select(ID, Q2_1, Q2_2, Q14_1, Q15)

# A tibble: 6 × 5
  ID    Q2_1  Q2_2  Q14_1     Q15  
  <chr> <chr> <chr> <chr>     <chr>
1 1     Yes   NA    Sometimes NA   
2 2     -99   NA    Always    Yes  
3 3     Yes   NA    NA        NA   
4 4     -99   No    Yes       No   
5 5     NA    -99   Always    NA   
6 6     -99   No    NA        NA

CodePudding user response：

A shorter approach. Just strip all ".x" suffixes off the variable names and then transmute variables whose names are not "ID" or do not end with ".y". For each of those variables, get0 the counterpart with the ".y" suffix and do the replacement as follows. Note that if there is no counterpart with a ".y" suffix, get0 returns NULL and idx thus collapses to integer(0). As a result, the variable is returned as-is.

library(dplyr)

ord_dat %>% 
  rename_with(~sub("\\.x", "", .), ends_with(".x")) %>% 
  transmute(ID, across(!ID & !ends_with(".y"), ~{
    .y <- get0(paste0(cur_column(), ".y"))
    idx <- which(.x %in% c("-99", NA) & !is.na(.y))
    replace(.x, idx, .y[idx])
  }))

Here is a tidyverse approach. We would

pivot_longer the dataset into four columns: ID, Q(uestion), x, and y.
Replace any values in x at positions where x shows NA or "-99" but y shows a non-NA value.
Remove column y and pivot_wider the dataset with names from Q and values from x.

The steps are not complicated, but they do require granular controls of the functions. This makes the code a bit long:

library(dplyr)
library(tidyr)

ord_dat %>% 
  pivot_longer(
    -ID, 
    names_to = c("Q", ".value"), 
    names_pattern = "([^\\.]*)\\.?([^\\.]*)", 
    names_transform = list(.value = ~replace(., . == "", "x"))
  ) %>% 
  mutate(x = {
    idx <- which(x %in% c("-99", NA) & !is.na(y))
    replace(x, idx, y[idx])
  }, y = NULL) %>% 
  pivot_wider(names_from = Q, values_from = x)

The output

# A tibble: 6 x 5
  ID    Q2_1  Q2_2  Q14_1     Q15  
  <chr> <chr> <chr> <chr>     <chr>
1 1     Yes   NA    Sometimes NA   
2 2     No    NA    Always    Yes  
3 3     Yes   NA    NA        NA   
4 4     -99   No    Yes       No   
5 5     NA    -99   Always    NA   
6 6     -99   No    NA        NA