I would like to remove all data.frame columns, where ALL of its values are NA. Additionally, I want to exclude those columns, where only for a certain number of rows (defined by a grouping variable) are NA.
Example:
library(tidyverse)
df <- tribble(~group, ~id, ~x, ~y, ~z,
"1", 1, 1, 1, NA,
"1", 2, NA, 1, NA,
"2", 1, 3, 1, NA,
"2", 2, NA, 1, NA,
"3", 1, 5, 1, NA,
"3", 2, NA, 1, NA,
"4", 1, 7, NA, 1,
"4", 2, NA, NA, 1,
"5", 1, 9, NA, NA,
"5", 2, NA, NA, NA)
In this example, I would like to keep x
and z
, but remove y
. That is because for the groups 4 and 5, its values are all NA. The other two columns also have a lot of missing values, but crucially have at least some values in groups 4 to 5. Ideally, I would like to do that with {dplyr}.
Code / Problem:
I'm kind of stuck with this code, which, for obvious reasons, doesn't work.
df %>%
mutate(group_new = ifelse(group=="4"|group=="5", 1, 0)) %>%
group_by(group_new) %>%
select_if(function(col) all(!is.na(col)))
There are several problems with this, which is that I would like to avoid creating a new grouping variable, and of course the fact that select_if
doesn't take the group_by()
condition into account but rather looks for complete columns (of which there are none).
Any help with this is much appreciated!
CodePudding user response:
library(dplyr)
df %>%
select(-where(~ all(is.na(.x[df$group %in% 4:5]))))
output
# A tibble: 10 × 4
group id x z
<chr> <dbl> <dbl> <dbl>
1 1 1 1 NA
2 1 2 NA NA
3 2 1 3 NA
4 2 2 NA NA
5 3 1 5 NA
6 3 2 NA NA
7 4 1 7 1
8 4 2 NA 1
9 5 1 9 NA
10 5 2 NA NA