Home > Software engineering >  Replace whole value by NA if specific character is found
Replace whole value by NA if specific character is found

Time:11-10

I would like to replace values by NA in specific rows if a specific character is found within the current value, f.e. if a value contains "<" (lower than), f.e. "<7.5" I would like to replace the whole value by NA.

Examples:

Column A: 3, 4, 8, <5.6, 1, 3
Column B: 7, 4, <6, 1, <2.2, 8

should be converted to:

Column A: 3, 4, 8, NA, 1, 3
Column B: 7, 4, NA, 1, NA, 8

I found this example here (https://dplyr.tidyverse.org/reference/na_if.html) with mutate and na_if(), but it requires to match the whole string, f.e.

y <- c("abc", "def", "", "ghi")
na_if(y, "def")

So "def" would be replaced by NA. But if I use

y <- c("abc", "def", "", "ghi")
na_if(y, "ef")

nothing is replaced. There is also an example with

library(dplyr)
data <- starwars
data %>%
  select(name, eye_color) %>%
  mutate(name = na_if(name, "Luke Skywalker")) %>% 
  mutate(eye_color = na_if(eye_color, "unknown")) -> dataedited

And this code works perfect for me, but also need exact match instead of just a part of the string. This way I could edit each column manually, maybe there is a way to perform this across multiple columns. I would like to convert values to NA if name contains "sky", or eye contains "unkn".

Can anyone help me?

Thank you!

CodePudding user response:

The na_if wouldn't take more than one element in y. We can create a logical vector in replace to replace the values to NA. For multiple columns, use across

library(dplyr)
data <- data %>%
   mutate(across(c(name, eye_color),
       ~ replace(.,  . %in% c("Luke Skywalker", "unknown"), NA)))

For partial match, use a regex in str_detect or grepl

library(stringr)
data <- data %>%
    mutate(across(c(name, eye_color),
       ~ replace(.,   str_detect(., "sky|unkn"), NA)))

CodePudding user response:

I've also found that na_if() wasn't flexible enough, so I often use my own version na_predicate(). It's got two arguments: the vector to edit, and a predicate function that returns TRUE or FALSE.

For your situation, you can combine it with dplyr's across(), to edit multiple columns.

library(dplyr)
library(stringr)

na_predicate <- function(x, fn) {
  predicate <- rlang::as_function(fn)
  
  x[predicate(x)] <- NA
  
  x
}

# Example of a simple predicate function. By default, it's applied to the vector
# to change
is_even <- function(x) x %% 2 == 0

na_predicate(1:10, is_even)
#>  [1]  1 NA  3 NA  5 NA  7 NA  9 NA


# But you can use the formula notation to make it apply to something else
# instead
na_predicate(c("a", "b", "c", "d"), ~ is_even(1:4))
#> [1] "a" NA  "c" NA



# Applying it to starwars data. Here's the original:
original_data <- starwars %>%
  select(name, eye_color, skin_color) %>% 
  head() %>% 
  print()
#> # A tibble: 6 x 3
#>   name           eye_color skin_color 
#>   <chr>          <chr>     <chr>      
#> 1 Luke Skywalker blue      fair       
#> 2 C-3PO          yellow    gold       
#> 3 R2-D2          red       white, blue
#> 4 Darth Vader    yellow    white      
#> 5 Leia Organa    brown     light      
#> 6 Owen Lars      blue      light
   

# And here I'm using na_predicate() to turn any value in the name/eye_color
# columns that contains an "l" into NA:
original_data %>% 
  mutate(across(c(name, eye_color),
                na_predicate, ~ str_detect(., "l")))
#> # A tibble: 6 x 3
#>   name        eye_color skin_color 
#>   <chr>       <chr>     <chr>      
#> 1 <NA>        <NA>      fair       
#> 2 C-3PO       <NA>      gold       
#> 3 R2-D2       red       white, blue
#> 4 Darth Vader <NA>      white      
#> 5 Leia Organa brown     light      
#> 6 Owen Lars   <NA>      light

Created on 2021-11-09 by the reprex package (v2.0.1)

CodePudding user response:

Just convert the column to numeric and the components that are not numeric will be converted to NA. This will generate warnings but they can be suppressed.

Alternately in the second approach below check if there are non-digit non-dots and use NA for those and then convert to numeric in which case there will be no warnings in the first place.

The third approach is the same except it assumes that the values to be converted to NA all contain <.

The fourth approach replaces any component starting with < with just < and then uses na_if.

x <- c(7, 4, "<6", 1, "<2.2", 8)

# 1
suppressWarnings(as.numeric(x))  
## [1]  7  4 NA  1 NA  8

# 2
as.numeric(ifelse(grepl("[^0-9.]", x), NA, x))
## [1]  7  4 NA  1 NA  8

# 3
as.numeric(ifelse(grepl("<", x), NA, x))
## [1]  7  4 NA  1 NA  8

# 4
library(dplyr)
as.numeric(na_if(sub("<.*", "<", x), "<"))
## [1]  7  4 NA  1 NA  8

If we have several values that we wish to map to NA or a regex pattern then use replace like this:

y <- head(letters)

# 5
replace(y, y %in% c("a", "c"), NA)
## [1] NA  "b" NA  "d" "e" "f"

# 6
replace(y, grepl("a|c", y), NA)
## [1] NA  "b" NA  "d" "e" "f"
  •  Tags:  
  • r na
  • Related