I was working on a script today, and noticed some very unexpected outputs. Upon inspection, I found that one variable in my dataset, which should always be numeric, has one character value (essentially one cell with a typed "N/A" rather than a value properly read in as NA). This is not really a problem, as I can manually re-code this value as NA. What I am curious about is why I did not receive an error while indexing on this vector, and how to interpret the output. An example is provided below:
c("56.2", "84.7", "63", "9", "109.5", "16", "N/A", "50") >= 50
Results in the output:
TRUE TRUE TRUE TRUE FALSE FALSE TRUE TRUE
The logic behind which entries are marked as TRUE or FALSE is not immediately obvious to me. Could anyone provide an explanation?
CodePudding user response:
Additional note to the explanation by Merjin van Tilborg:
x <- c("56.2", "84.7", "63", "9", "109.5", "16", "N/A", "50")
x >= 50
# gives
[1] TRUE TRUE TRUE TRUE FALSE FALSE TRUE TRUE
# Now check which indexes fulfill this comparison (the why is explained by Merjin van Tilborg)
which(x >= 50)
[1] 1 2 3 4 7 8
# if you do like this:
as.numeric(x) >=50
# you get:
[1] TRUE TRUE TRUE FALSE TRUE FALSE NA TRUE
Warning message:
NAs introduced by coercion
CodePudding user response:
Because a comparison on characters is done on alphabetical order and numbers come before letters, "100.9" starts with a 1 so comes first than the 5 in "50" and therefor "smaller" / earlier in order.
"ab" > "b"
# a comes before b
# [1] FALSE
"12" > "2"
# 1 comes before 2 as character
# [1] FALSE