dplyr mutate converts double to logical, bug or correct?-CodePudding

See the code below

library(dplyr)

df = tibble(
  x1 = c(0, 1, 2, 3),
  x2 = c(0, NA, 1, NA),
  x3 = as.double(NA)
)

df %>% 
  mutate(x1 = ifelse(x1 == 0, NA, x1)) %>% 
  mutate(x2 = ifelse(x2 == 0, NA, x2)) %>% 
  mutate(x3 = ifelse(x3 == 0, NA, x3)) %>% 
  str()

df %>% 
  rowwise() %>% 
  mutate(x1 = ifelse(x1 == 0, NA, x1)) %>% 
  mutate(x2 = ifelse(x2 == 0, NA, x2)) %>% 
  mutate(x3 = ifelse(x3 == 0, NA, x3)) %>% 
  str()

Column x3 is converted to logical which caused an issue in one of my codes recently.

Is this correct or is this a bug?

I cannot get the logic as for columns x1 and x2 this works correctly.

CodePudding user response：

In R, NA is a length-1 logical vector.

class(NA)
#> [1] "logical"

The equivalent missing value for numeric data is NA_real_. This is usually overlooked because when NA_real_ is printed, it is printed as NA:

NA_real_
#> [1] NA

class(NA_real_)
#> [1] "numeric"

When you create a numeric vector with NA values, the NA are actually converted to NA_real_:

dput(c(1, NA)[2])
#> NA_real_

In your example, you have already explicitly converted x3 to double, so the column is now filled with NA_real

class(as.double(NA))
#> [1] "numeric"

dput(as.double(NA))
#> NA_real_

This is all as expected. But inside ifelse, the first argument is always logical. In your case, x3 is the equivalent of:

c(NA_real_, NA_real_, NA_real_, NA_real_)

But the expression c(NA_real_, NA_real_, NA_real_, NA_real) == 0 returns a logical vector, since it is a logical test; you are asking "are these values equal to zero?".

class(c(NA_real_, NA_real_, NA_real_, NA_real_) == 0)
#> [1] "logical"

Inside ifelse, although there are parameters to specify what to return for TRUE and FALSE values of the logical test, there is no value for what to return in the event of NA, and a logical NA is returned if you attempt the comparison.

In the case of column x2, there is one numeric value returned by ifelse, so the other 3 logical NA values in that column are converted to NA_real_.

x2 <- c(0, NA, 2, NA)
ifelse(x2 == 0, NA, x2)
#> [1] NA NA  2 NA

dput(ifelse(x2 == 0, NA, x2)[1])
#> NA_real_

However, in the final column, there are only logical NA values returned, and nothing in your code to convert them to NA_real_, so the column remains a logical NA column.

There are a few possible solutions, but the way to do this in dplyr is to use if_else instead of ifelse, since this does the same thing as ifelse but preserves type safety. You will also need to specify NA_real_ to keep the type safety:

df %>% 
  mutate(x1 = if_else(x1 == 0, NA_real_, x1)) %>% 
  mutate(x2 = if_else(x2 == 0, NA_real_, x2)) %>% 
  mutate(x3 = if_else(x3 == 0, NA_real_, x3)) %>% 
  str()
#> tibble [4 x 3] (S3: tbl_df/tbl/data.frame)
#>  $ x1: num [1:4] NA 1 2 3
#>  $ x2: num [1:4] NA NA 1 NA
#>  $ x3: num [1:4] NA NA NA NA

^{Created on 2022-12-18 with reprex v2.0.2}