Home > Net >  R: Replace NAs based across 140 columns to values from another 140 columns in same dataset
R: Replace NAs based across 140 columns to values from another 140 columns in same dataset

Time:06-08

I’m fairly new to R. I’ve been searching for multiple hours and asking colleagues who have more experience with R for help , but still haven’t found a solution. I did find a related thread here that provided some inspiration, but I couldn't figure out how to extend the approach to solve my issue.

Overview

I have a dataset with 1800 variables. The issue is that the dataset contains two sets of 140 variables with different data that were created from when the same survey went out at two different timepoints by accident based upon an automation error. The database created two versions of the same variable. I’m trying to index a unique column to check if the value for that row is “NA”, if it is then I want to fill it with values from the second set of 140 columns and overwrite the values in the original 140 columns. If the row isn’t “NA”, then I want it to leave it as is.

Dummy Data Set

Based on the below dummy set, what I’m attempting to do is check “week0date_v2” to see if it’s a NA value for a given row. If yes, then grab the values in that row from the columns from “week0date” to “var3” and write these values to same row in the columns from “week0date_v2” to “var3_v2”. If the value is not NA for a given row, then leave the original data.

library('tidyverse')
dummy_data<- data.frame(
  id = c(1,2,3,4,5,6), 
  week0date_v2 = c("2022/2/1", NA, "2022/2/15", "2022/2/18", "2022/2/15", NA),
  var1_v2 = c(75,NA,66,NA,65,NA), 
  var2_v2 = c(19,NA,67,10,54,NA), 
  var3_v2 = c(NA,NA,99,125,73,NA),
  verify_v2 = c(1,NA,1,1,1,NA),
  week0date = c(NA, "2022/3/2", "2022/4/15", "2022/4/18", NA, "2022/4/15"),
  var1 = c(NA,52,NA,63,NA,75),
  var2 = c(NA,78,NA,NA,NA,63),
  var3 = c(15,64,25,NA,20,13),
  verify = c(NA,1,NA,1,NA,1)
)
dummy_data <- dummy_data %>% mutate_if(is.character, as.Date)

My Current Thinking

I’ve tried multiple approaches and haven’t been successful.

My colleague and I first tried to get it to work for a single variable (just the “week0date_v2” column), which we were able to do using mutate() and case_when() approach below:

dummy_data_first_attempt <- dummy_data %>% 
  mutate(
    week0date_v2 = case_when(
      !is.na(week0date_v2) ~ week0date_v2, 
      is.na(week0date_v2) ~ week0date
    )) 

I don’t know how to extend the above approach to apply across the span of columns “week0date_v2” to “var3_v2." I tried writing a function to do this, but my skills with writing a function weren’t sufficient.

Any help with how to extend what the above to achieve the below desired outcome would be so greatly appreciated !!!

Desired Outcome

This is what I'm hoping to end up with (i.e., the values from “week0date” to “var3” are copied and pasted into “week0date_v2” to “var3_v2” when a given row in "week0date_v2" is NA ... else leave the original data)

dummy_data_wrangled <- data.frame(
  id = c(1,2,3,4,5,6), 
  week0date_v2 = c("2022/2/1", "2022/3/2", "2022/2/15", "2022/2/18", "2022/2/15", "2022/4/15"),
  var1_v2 = c(75,52,66,NA,65,75), 
  var2_v2 = c(19,78,67,10,54,63), 
  var3_v2 = c(NA,64,99,125,73,13),
  verify_v2 = c(1,NA,1,1,1,NA),
  week0date = c(NA, "2022/3/2", "2022/4/15", "2022/4/18", NA, "2022/4/15"),
  var1 = c(NA,52,NA,63,NA,75),
  var2 = c(NA,78,NA,NA,NA,63),
  var3 = c(15,64,25,NA,20,13),
  verify = c(NA,1,NA,1,NA,1)
)

CodePudding user response:

Add dplyr way below.

I tried both dplyr and base R.

In dplyr way,

dummy_data %>%
  mutate(across(ends_with("_v2") & !starts_with("verify"), ~ifelse(
    !is.na(week0date_v2),
    .x,
    get(str_remove(cur_column(), "_v2"))
  )))

This below is pretty messy for loop way.

for (j in 1:nrow(dummy_data)){
  if (is.na(dummy_data$week0date_v2[j])) {
    for (i in names(dummy_data)[endsWith(names(dummy_data), "_v2") & !startsWith(names(dummy_data), "verify")]) {
      dummy_data[j, i] <- dummy_data[j, str_remove(i, "_v2")]
    }
  }
}

  id week0date_v2 var1_v2 var2_v2 var3_v2 verify_v2 week0date var1 var2 var3 verify
1  1     2022/2/1      75      19      NA         1      <NA>   NA   NA   15     NA
2  2     2022/3/2      52      78      64        NA  2022/3/2   52   78   64      1
3  3    2022/2/15      66      67      99         1 2022/4/15   NA   NA   25     NA
4  4    2022/2/18      NA      10     125         1 2022/4/18   63   NA   NA      1
5  5    2022/2/15      65      54      73         1      <NA>   NA   NA   20     NA
6  6    2022/4/15      75      63      13        NA 2022/4/15   75   63   13      1

CodePudding user response:

Here's a vectorised base R approach -

replace_inds <- is.na(dummy_data$week0date_v2)
cols <- c('week0date', grep('^var\\d $', names(dummy_data), value = TRUE))
cols_to_replace <- paste0(cols, '_v2')
dummy_data[replace_inds,cols_to_replace] <- dummy_data[replace_inds, cols]
dummy_data

#  id week0date_v2 var1_v2 var2_v2 var3_v2 verify_v2  week0date var1 var2 var3 verify
#1  1   2022-02-01      75      19      NA         1       <NA>   NA   NA   15     NA
#2  2   2022-03-02      52      78      64        NA 2022-03-02   52   78   64      1
#3  3   2022-02-15      66      67      99         1 2022-04-15   NA   NA   25     NA
#4  4   2022-02-18      NA      10     125         1 2022-04-18   63   NA   NA      1
#5  5   2022-02-15      65      54      73         1       <NA>   NA   NA   20     NA
#6  6   2022-04-15      75      63      13        NA 2022-04-15   75   63   13      1

Note that this answer is very much based on the column names of the data that you have so make sure you adjust it according to your dataset.

Here is some explanation of the code -

replace_inds returns TRUE for NA values in week0date_v2. The TRUE values are the rows that we want to replace.

replace_inds
#[1] FALSE  TRUE FALSE FALSE FALSE  TRUE

cols are the name of the columns which we want to copy the values from in case of NA in week0date_v2.

cols
#[1] "week0date" "var1"      "var2"      "var3"     

cols_to_replace are the columns which are to be replaced by corresponding values in cols.

cols_to_replace
#[1] "week0date_v2" "var1_v2"      "var2_v2"      "var3_v2"  

CodePudding user response:

Here's an approach that I think might be more "vectorized" than Ronak Shah's. It uses two-column matrix indexing on both sides of congruent sections of the dataframe. Admittedly it does include the verify columns but I don't think that really detracts from this demonstration:

dummy_data[2:6][ which( is.na(dummy_data[2:6]), arr.ind=TRUE)] <-   dummy_data[7:11][which( is.na(dummy_data[2:6]), arr.ind = TRUE) ]
> dummy_data
  id week0date_v2 var1_v2 var2_v2 var3_v2 verify_v2 week0date var1 var2 var3 verify
1  1     2022/2/1      75      19      15         1      <NA>   NA   NA   15     NA
2  2     2022/3/2      52      78      64         1  2022/3/2   52   78   64      1
3  3    2022/2/15      66      67      99         1 2022/4/15   NA   NA   25     NA
4  4    2022/2/18      63      10     125         1 2022/4/18   63   NA   NA      1
5  5    2022/2/15      65      54      73         1      <NA>   NA   NA   20     NA
6  6    2022/4/15      75      63      13         1 2022/4/15   75   63   13      1

I did a separate run without changing the "verify" column. I defined the selection matrix separately:

> tofind <- which( is.na(dummy_data[grepl("(week|var). v2", names(dummy_data) )]), arr.ind=TRUE)
> cols_to_repl <- grepl("(week|var). v2", names(dummy_data) )
> dummy_data[cols_to_repl][ tofind ] <-   dummy_data[7:11][ tofind  ]
> dummy_data
  id week0date_v2 var1_v2 var2_v2 var3_v2 verify_v2 week0date var1 var2 var3 verify
1  1     2022/2/1      75      19      15         1      <NA>   NA   NA   15     NA
2  2     2022/3/2      52      78      64        NA  2022/3/2   52   78   64      1
3  3    2022/2/15      66      67      99         1 2022/4/15   NA   NA   25     NA
4  4    2022/2/18      63      10     125         1 2022/4/18   63   NA   NA      1
5  5    2022/2/15      65      54      73         1      <NA>   NA   NA   20     NA
6  6    2022/4/15      75      63      13        NA 2022/4/15   75   63   13      1
  •  Tags:  
  • r
  • Related