How are missings represented in R?-CodePudding

Beforehand

Most obvious answer to the title is that missings are represented with NA in R. Dummy data:

x <- c("a", "NA", "<NA>", NA)

We can transform all elements of x to characters using x_paste0 <- paste0(x). After doing so, the second and fourth elements are same ("NA") and to my knowledge this is why there is no way to backtransform x_paste0 to x.

addNA

But working with addNA indicates that it is not just the NA itself that represents missings. In x only the last element is a missing. Let's transform the vector:

x_new <- addNA(x)
x_new
[1] a    NA   <NA> <NA>
Levels: <NA> a NA <NA>

Interestingly, the fourth element, i.e. the missing is shown with <NA> and not with NA. Further, now the fourth element looks same as the third. And we are told that there are no missings because when we try any(is.na(x_new)) we get FALSE. At this point I would have thought that the information about what element is the missing (the third or the fourth) is simply lost as it was in x_paste0. But this is not true because we can actually backtransform x_new. See:

as.character(x_new)
[1] "a"    "NA"   "<NA>" NA

How does as.character know that the third element is "<NA>" and the fouth is an actual missing, i.e. NA?

CodePudding user response：

That's probably a uncleanness in the base:::print.factor() method.

x <- c("a", "NA", "<NA>", NA)

addNA(x)
# [1] a    NA   <NA> <NA>
# Levels: <NA> a NA <NA>

But:

levels(addNA(x))
# [1] "<NA>" "a"    "NA"   NA

So, there are no duplicated levels.

CodePudding user response：

Usually you try to prevent this when you read your data, either a csv or other source. A bit of a silly demo using read.table on your vector sample data.

x <- c("a", "NA", "<NA>", NA)
x <- read.table(text = x, na.strings = c("NA", "<NA>", ""), stringsAsFactors = F)$V1
x
[1] "a" NA  NA  NA

But if you want to fix it afterwards

x <- c("a", "NA", "<NA>", NA)
na_strings <- c("NA", "<NA>", "")

unlist(lapply(x, function(v) { ifelse(v %in% na_strings, NA, v) }))

[1] "a" NA  NA  NA

some notes on factors and addNA

# to not be confused with character values pretending to be missing values but are not
x <- c("a", "b", "c", NA)

x_1 <- addNA(x)
x_1

# do not get confused on how the displayed output is
# [1] a    b    c    <NA>
# Levels: a b c <NA>
  
str(x_1)
# Factor w/ 4 levels "a","b","c",NA: 1 2 3 4

is.na(x_1) # as your actual values are 1, 2, 3, 4
# [1] FALSE FALSE FALSE FALSE

is.na(levels(x_1))
# [1] FALSE FALSE FALSE TRUE

# but nothing is lost
x_2 <- as.character(x_1)

str(x_2)
# chr [1:4] "a" "b" "c" NA

is.na(x_2)
# [1] FALSE FALSE FALSE  TRUE