I'm trying to count the number of NA in multiple columns of my data. Here is a reproducible sample.
structure(list(V2QE38A = c(1, 0, 1, 0, 1, 1, 1, 0, 1, 0), V2QE38B = c(0,
0, 0, 0, 0, 1, 0, 0, 0, 0), V2QE38C = c(1, 1, 0, 3, 2, 0, 0,
3, 1, 1), V2QE38D = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0)), row.names = c(NA,
10L), class = "data.frame")
I tried two methods: First one:
dt %>% select(starts_with("V2QE38")) %>% colSums(is.na(.))
And this gives me some results (in short, I have NAs in some columns) Then I tried another one:
colSums(is.na(dt[,c("V2QE38A", "V2QE38B", "V2QE38C", "V2QE38D")]))
And I found no NA in any of these columns.
I think the second result is correct. But I'm just wondering what did I do wrong to get the first result? Thank you!
CodePudding user response:
In the first case, there are multiple functions passed. We may either need to block it with {}
library(dplyr)
dt %>%
select(starts_with("V2QE38")) %>%
{colSums(is.na(.))}
V2QE38A V2QE38B V2QE38C V2QE38D
0 0 0 0
or have another %>%
dt %>%
select(starts_with("V2QE38")) %>%
is.na %>%
colSums
-output
V2QE38A V2QE38B V2QE38C V2QE38D
0 0 0 0
The issue is that colSums
is executed first without evaluating the is.na
> dt %>%
select(starts_with("V2QE38")) %>%
colSums(.)
V2QE38A V2QE38B V2QE38C V2QE38D
6 1 12 0
which is the same as the OP's output with colSums(is.na(.))
CodePudding user response:
Base solution using sapply
and an annonymous function function(x){sum(is.na(x))}
:
data = structure(list(V2QE38A = c(1, 0, 1, 0, 1, 1, 1, 0, 1, 0), V2QE38B = c(0,
0, 0, 0, 0, 1, 0, 0, 0, 0), V2QE38C = c(1, 1, 0, 3, 2, 0, 0,
3, 1, 1), V2QE38D = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0)), row.names = c(NA,
10L), class = "data.frame")
sapply(data, function(x){sum(is.na(x))})
# V2QE38A V2QE38B V2QE38C V2QE38D
# 0 0 0 0
Explanation:
sapply
applies a function on a list
. data.frame
is a list, with each vector being an item of this list. The s
in sapply
is for simplify, so sapply
will try to convert the output list (from lapply
) to a vector. If the required output is a list (it has some advantages), use lapply
instead.
is.na
returns a boolean TRUE/FALSE
vector. This can be converted to a numeric vector with 1/0
values.
sum
converts the TRUE/FALSE
vector into a 1/0
vector and sums the values.
Alternative solutions:
Alternatively, instead of treating the data.frame
as a list, treat it as a matrix. Then the highly optimized rowSums
and colSums
can come into play.
colSums(is.na(data))
# V2QE38A V2QE38B V2QE38C V2QE38D
# 0 0 0 0
rowSums(is.na(data))
# 1 2 3 4 5 6 7 8 9 10
# 0 0 0 0 0 0 0 0 0 0
This is great if you have a matrix
and want to find where the NA
s are.