Home > front end >  replace values with NA in several columns
replace values with NA in several columns

Time:12-28

I have a huge dataset and wish to replace values in certain columns (VAR1, VAR2) with NA if they do not start with AA or DD.

Data:

DF<-tibble::tribble(
  ~ID,  ~VAR1,    ~VAR1DATE,  ~VAR2,    ~VAR2DATE,
   1L, "AABB", "2001-01-01", "BBAA", "2001-01-01",
   2L, "AACC", "2001-01-02", "AACC", "2001-01-02",
   3L, "CCDD", "2001-01-03", "DDCC", "2001-01-03",
   4L, "DDAA", "2001-01-04", "CCBB", "2001-01-04",
   5L, "CCBB", "2001-01-05", "CCBB", "2001-01-05"
  )

Desired output:

A tibble: 5 × 5
     ID VAR1  VAR1DATE   VAR2  VAR2DATE  
  <int> <chr> <chr>      <chr> <chr>     
1     1 AABB  2001-01-01 NA    NA        
2     2 AACC  2001-01-02 AACC  2001-01-02
3     3 NA    NA         DDCC  2001-01-03
4     4 DDAA  2001-01-04 NA    NA        
5     5 NA    NA         NA    NA  

Is there an elegant and smart way to this? mutate_all?

CodePudding user response:

We may do this in two steps - loop across the columns that have 'VAR' followed by digits (\\d ) in column names, replace the values where the first two characters are not AA or DD to NA, then replace the corresponding DATE column to NA based on the NA in the 'VAR1', 'VAR2' columns

library(dplyr)
library(stringr)
DF %>%
    mutate(across(matches("^VAR\\d $"),
        ~ replace(., !substr(., 1, 2)  %in% c("AA", "DD"), NA)), 
      across(ends_with("DATE"), 
     ~ replace(., is.na(get(str_remove(cur_column(), "DATE"))), NA)))

-output

# A tibble: 5 × 5
     ID VAR1  VAR1DATE   VAR2  VAR2DATE  
  <int> <chr> <chr>      <chr> <chr>     
1     1 AABB  2001-01-01 <NA>  <NA>      
2     2 AACC  2001-01-02 AACC  2001-01-02
3     3 <NA>  <NA>       DDCC  2001-01-03
4     4 DDAA  2001-01-04 <NA>  <NA>      
5     5 <NA>  <NA>       <NA>  <NA>      

CodePudding user response:

Use mutate/across with the assignment function is.na<-.

DF %>%
  mutate(across(starts_with("VAR"), \(x) `is.na<-`(x, !grepl("^AA|^DD", x))))
## A tibble: 5 x 5
#     ID VAR1  VAR1DATE VAR2  VAR2DATE
#  <int> <chr> <chr>    <chr> <chr>   
#1     1 AABB  NA       NA    NA      
#2     2 AACC  NA       AACC  NA      
#3     3 NA    NA       DDCC  NA      
#4     4 DDAA  NA       NA    NA      
#5     5 NA    NA       NA    NA      

Or simpler:

DF %>%
  mutate(across(starts_with("VAR"), ~`is.na<-`(., !grepl("^AA|^DD", .))))

CodePudding user response:

Here is a tidyverse solution. Using across with str_replace_all and appending two ifelse statements.

library(dplyr)
library(stringr)

DF %>% 
  mutate(across(c(VAR1, VAR2), ~str_replace_all(., "^[^AA|DD]", NA_character_))) %>% 
  mutate(VAR1DATE = ifelse(is.na(VAR1), NA_character_, VAR1DATE),
         VAR2DATE = ifelse(is.na(VAR2), NA_character_, VAR2DATE))
     ID VAR1  VAR1DATE   VAR2  VAR2DATE  
  <int> <chr> <chr>      <chr> <chr>     
1     1 AABB  2001-01-01 NA    NA        
2     2 AACC  2001-01-02 AACC  2001-01-02
3     3 NA    NA         DDCC  2001-01-03
4     4 DDAA  2001-01-04 NA    NA        
5     5 NA    NA         NA    NA 

CodePudding user response:

Another possibility, using tidyr::unite and tiydr::separate:

library(tidyverse)

DF<-tibble::tribble(
  ~ID,  ~VAR1,    ~VAR1DATE,  ~VAR2,    ~VAR2DATE,
  1L, "AABB", "2001-01-01", "BBAA", "2001-01-01",
  2L, "AACC", "2001-01-02", "AACC", "2001-01-02",
  3L, "CCDD", "2001-01-03", "DDCC", "2001-01-03",
  4L, "DDAA", "2001-01-04", "CCBB", "2001-01-04",
  5L, "CCBB", "2001-01-05", "CCBB", "2001-01-05"
)

DF %>% 
  unite(VAR1,VAR1,VAR1DATE) %>% unite(VAR2,VAR2,VAR2DATE) %>% 
  mutate(across(starts_with("VAR"),~if_else(str_detect(.x, "^AA|^DD"), .x, NA_character_))) %>% 
  separate(VAR1,into = c("VAR1", "VAR1DATE"), sep = "_") %>% 
  separate(VAR2,into = c("VAR2", "VAR2DATE"), sep = "_")

#> # A tibble: 5 × 5
#>      ID VAR1  VAR1DATE   VAR2  VAR2DATE  
#>   <int> <chr> <chr>      <chr> <chr>     
#> 1     1 AABB  2001-01-01 <NA>  <NA>      
#> 2     2 AACC  2001-01-02 AACC  2001-01-02
#> 3     3 <NA>  <NA>       DDCC  2001-01-03
#> 4     4 DDAA  2001-01-04 <NA>  <NA>      
#> 5     5 <NA>  <NA>       <NA>  <NA>

CodePudding user response:

Here is another tidyverse solution with using str_detect to determine where to convert to NA for the date columns. Then, we can use the same type of function on VAR1 and VAR2.

library(tidyverse)

DF %>%
  rowwise %>%
  mutate(
    VAR1DATE = ifelse(str_detect(VAR1, '^BB|^CC') == TRUE, NA, VAR1DATE),
    VAR2DATE = ifelse(str_detect(VAR2, '^BB|^CC') == TRUE, NA, VAR2DATE)
  ) %>%
  mutate(across(c(VAR1, VAR2), function(x)
    ifelse(str_detect(x, '^BB|^CC') == TRUE, NA, x)))

Output

# A tibble: 5 × 5
# Rowwise: 
     ID VAR1  VAR1DATE   VAR2  VAR2DATE  
  <int> <chr> <chr>      <chr> <chr>     
1     1 AABB  2001-01-01 NA    NA        
2     2 AACC  2001-01-02 AACC  2001-01-02
3     3 NA    NA         DDCC  2001-01-03
4     4 DDAA  2001-01-04 NA    NA        
5     5 NA    NA         NA    NA      
  •  Tags:  
  • r
  • Related