Home > Back-end >  Why do conditions with %in% ignore missing values?
Why do conditions with %in% ignore missing values?

Time:07-15

I encountered an unexpected output when I used %in% in a condition whilst recoding a categorical variable.

When an element of a vector on the left is NA, the condition evaluates as FALSE, whilst I expected it to be NA.

The expected behaviour is the more verbose statement with two == conditions separated by an |

dt <- data.frame(colour = c("red", "orange", "blue", NA))

# Expected
dt$is_warm1 <- ifelse(dt$colour == "red" | dt$colour == "orange", TRUE, FALSE)

# Unexpected
dt$is_warm2 <- ifelse(dt$colour %in% c("red", "orange"), TRUE, FALSE)

dt
#>   colour is_warm1 is_warm2
#> 1    red     TRUE     TRUE
#> 2 orange     TRUE     TRUE
#> 3   blue    FALSE    FALSE
#> 4   <NA>       NA    FALSE

This is quite unhelpful when recoding categorical variables because it silently fills missing values. Why does this happen, and are there any alternatives that don't involve listing all the == conditions? (Imagine that colour contains thirty possible levels).

CodePudding user response:

a %in% b is just shorthand for match(a, b, nomatch = 0) > 0 (check the source code for %in% to satisfy yourself that this is the case).

You can get your expected result by removing the nomatch = 0 argument:

match(dt$colour, c("red", "orange")) > 0
#> [1] TRUE TRUE   NA   NA

Which of course doesn't require the ifelse

CodePudding user response:

%in% checks to see if NA is in the list. Consider these two scenarios

NA %in% 1:3
# [1] FALSE
NA %in% c(1:3, NA)
# [1] TRUE

This allows you to check of NA is in the vector or not.

If you want to preserve NA values, you could write your own alternative

`%nain%` <- function(val, list) {
  ifelse(is.na(val), NA, val %in% list)
}

And then you can use

dt$is_warm3 <- dt$colour %nain% c("red", "orange")

CodePudding user response:

Here is some info from the help documentation ?%in%

So you can see in the last line %in% never returns NA so that is why it returns FALSE and not NA. It is checking for missing values as @MrFlick mentioned in his answer

Exactly what matches what is to some extent a matter of definition. For all types, NA matches NA and no other value. For real and complex values, NaN values are regarded as matching any other NaN value, but not matching NA, where for complex x, real and imaginary parts must match both (unless containing at least one NA).

Character strings will be compared as byte sequences if any input is marked as "bytes", and otherwise are regarded as equal if they are in different encodings but would agree when translated to UTF-8 (see Encoding).

That %in% never returns NA makes it particularly useful in if conditions.

  • Related