Beforehand
Most obvious answer to the title is that missings are represented with NA
in R. Dummy data:
x <- c("a", "NA", "<NA>", NA)
We can transform all elements of x
to characters using x_paste0 <- paste0(x)
. After doing so, the second and fourth elements are same ("NA"
) and to my knowledge this is why there is no way to backtransform x_paste0
to x
.
addNA
But working with addNA
indicates that it is not just the NA
itself that represents missings. In x
only the last element is a missing. Let's transform the vector:
x_new <- addNA(x)
x_new
[1] a NA <NA> <NA>
Levels: <NA> a NA <NA>
Interestingly, the fourth element, i.e. the missing is shown with <NA>
and not with NA
. Further, now the fourth element looks same as the third. And we are told that there are no missings because when we try any(is.na(x_new))
we get FALSE
. At this point I would have thought that the information about what element is the missing (the third or the fourth) is simply lost as it was in x_paste0
. But this is not true because we can actually backtransform x_new
. See:
as.character(x_new)
[1] "a" "NA" "<NA>" NA
How does as.character
know that the third element is "<NA>"
and the fouth is an actual missing, i.e. NA
?
CodePudding user response:
That's probably a uncleanness in the base:::print.factor()
method.
x <- c("a", "NA", "<NA>", NA)
addNA(x)
# [1] a NA <NA> <NA>
# Levels: <NA> a NA <NA>
But:
levels(addNA(x))
# [1] "<NA>" "a" "NA" NA
So, there are no duplicated levels.
CodePudding user response:
Usually you try to prevent this when you read your data, either a csv or other source. A bit of a silly demo using read.table on your vector sample data.
x <- c("a", "NA", "<NA>", NA)
x <- read.table(text = x, na.strings = c("NA", "<NA>", ""), stringsAsFactors = F)$V1
x
[1] "a" NA NA NA
But if you want to fix it afterwards
x <- c("a", "NA", "<NA>", NA)
na_strings <- c("NA", "<NA>", "")
unlist(lapply(x, function(v) { ifelse(v %in% na_strings, NA, v) }))
[1] "a" NA NA NA
some notes on factors and addNA
# to not be confused with character values pretending to be missing values but are not
x <- c("a", "b", "c", NA)
x_1 <- addNA(x)
x_1
# do not get confused on how the displayed output is
# [1] a b c <NA>
# Levels: a b c <NA>
str(x_1)
# Factor w/ 4 levels "a","b","c",NA: 1 2 3 4
is.na(x_1) # as your actual values are 1, 2, 3, 4
# [1] FALSE FALSE FALSE FALSE
is.na(levels(x_1))
# [1] FALSE FALSE FALSE TRUE
# but nothing is lost
x_2 <- as.character(x_1)
str(x_2)
# chr [1:4] "a" "b" "c" NA
is.na(x_2)
# [1] FALSE FALSE FALSE TRUE