Count NA in multiple columns in R-CodePudding

I'm trying to count the number of NA in multiple columns of my data. Here is a reproducible sample.

structure(list(V2QE38A = c(1, 0, 1, 0, 1, 1, 1, 0, 1, 0), V2QE38B = c(0, 
0, 0, 0, 0, 1, 0, 0, 0, 0), V2QE38C = c(1, 1, 0, 3, 2, 0, 0, 
3, 1, 1), V2QE38D = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0)), row.names = c(NA, 
10L), class = "data.frame")

I tried two methods: First one:

dt %>% select(starts_with("V2QE38")) %>% colSums(is.na(.))

And this gives me some results (in short, I have NAs in some columns) Then I tried another one:

colSums(is.na(dt[,c("V2QE38A", "V2QE38B", "V2QE38C", "V2QE38D")]))

And I found no NA in any of these columns.

I think the second result is correct. But I'm just wondering what did I do wrong to get the first result? Thank you!

CodePudding user response：

In the first case, there are multiple functions passed. We may either need to block it with {}

library(dplyr)
dt %>% 
    select(starts_with("V2QE38")) %>%
    {colSums(is.na(.))}
V2QE38A V2QE38B V2QE38C V2QE38D 
      0       0       0       0

or have another %>%

dt %>%
    select(starts_with("V2QE38")) %>%
    is.na %>%
    colSums

-output

V2QE38A V2QE38B V2QE38C V2QE38D 
      0       0       0       0

The issue is that colSums is executed first without evaluating the is.na

> dt %>% 
   select(starts_with("V2QE38")) %>% 
   colSums(.)
V2QE38A V2QE38B V2QE38C V2QE38D 
      6       1      12       0

which is the same as the OP's output with colSums(is.na(.))

CodePudding user response：

Base solution using sapply and an annonymous function function(x){sum(is.na(x))}:

data = structure(list(V2QE38A = c(1, 0, 1, 0, 1, 1, 1, 0, 1, 0), V2QE38B = c(0, 
0, 0, 0, 0, 1, 0, 0, 0, 0), V2QE38C = c(1, 1, 0, 3, 2, 0, 0, 
3, 1, 1), V2QE38D = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0)), row.names = c(NA, 
10L), class = "data.frame")

sapply(data, function(x){sum(is.na(x))})
# V2QE38A V2QE38B V2QE38C V2QE38D 
#       0       0       0       0

Explanation:

sapply applies a function on a list. data.frame is a list, with each vector being an item of this list. The s in sapply is for simplify, so sapply will try to convert the output list (from lapply) to a vector. If the required output is a list (it has some advantages), use lapply instead.

is.na returns a boolean TRUE/FALSE vector. This can be converted to a numeric vector with 1/0 values.

sum converts the TRUE/FALSE vector into a 1/0 vector and sums the values.

Alternative solutions:

Alternatively, instead of treating the data.frame as a list, treat it as a matrix. Then the highly optimized rowSums and colSums can come into play.

colSums(is.na(data))
# V2QE38A V2QE38B V2QE38C V2QE38D 
#       0       0       0       0 

rowSums(is.na(data))
# 1  2  3  4  5  6  7  8  9 10 
# 0  0  0  0  0  0  0  0  0  0

This is great if you have a matrix and want to find where the NAs are.