I encountered an unexpected output when I used %in%
in a condition whilst recoding a categorical variable.
When an element of a vector on the left is NA
, the condition evaluates as FALSE
, whilst I expected it to be NA
.
The expected behaviour is the more verbose statement with two ==
conditions separated by an |
dt <- data.frame(colour = c("red", "orange", "blue", NA))
# Expected
dt$is_warm1 <- ifelse(dt$colour == "red" | dt$colour == "orange", TRUE, FALSE)
# Unexpected
dt$is_warm2 <- ifelse(dt$colour %in% c("red", "orange"), TRUE, FALSE)
dt
#> colour is_warm1 is_warm2
#> 1 red TRUE TRUE
#> 2 orange TRUE TRUE
#> 3 blue FALSE FALSE
#> 4 <NA> NA FALSE
This is quite unhelpful when recoding categorical variables because it silently fills missing values. Why does this happen, and are there any alternatives that don't involve listing all the ==
conditions? (Imagine that colour
contains thirty possible levels).
CodePudding user response:
a %in% b
is just shorthand for match(a, b, nomatch = 0) > 0
(check the source code for %in%
to satisfy yourself that this is the case).
You can get your expected result by removing the nomatch = 0
argument:
match(dt$colour, c("red", "orange")) > 0
#> [1] TRUE TRUE NA NA
Which of course doesn't require the ifelse
CodePudding user response:
%in%
checks to see if NA
is in the list. Consider these two scenarios
NA %in% 1:3
# [1] FALSE
NA %in% c(1:3, NA)
# [1] TRUE
This allows you to check of NA is in the vector or not.
If you want to preserve NA values, you could write your own alternative
`%nain%` <- function(val, list) {
ifelse(is.na(val), NA, val %in% list)
}
And then you can use
dt$is_warm3 <- dt$colour %nain% c("red", "orange")
CodePudding user response:
Here is some info from the help documentation ?%in%
So you can see in the last line %in%
never returns NA so that is why it returns FALSE
and not NA
. It is checking for missing values as @MrFlick mentioned in his answer
Exactly what matches what is to some extent a matter of definition. For all types, NA matches NA and no other value. For real and complex values, NaN values are regarded as matching any other NaN value, but not matching NA, where for complex x, real and imaginary parts must match both (unless containing at least one NA).
Character strings will be compared as byte sequences if any input is marked as "bytes", and otherwise are regarded as equal if they are in different encodings but would agree when translated to UTF-8 (see Encoding).
That %in% never returns NA makes it particularly useful in if conditions.