I am trying to identify data frame columns where the columns have a single character value tree
.
Here is an example dataset.
df <- data.frame(id = c(1,2,3,4,5),
var.1 = c(5,6,7,"tree",4),
var.2 = c("tree","tree","tree","tree","tree"),
var.3 = c(4,5,8,9,1),
var.4 = c(NA,NA,NA,NA,NA),
var.5 = c("tree","tree",NA,"tree","tree"))
> df
id var.1 var.2 var.3 var.4 var.5
1 1 5 tree 4 NA tree
2 2 6 tree 5 NA tree
3 3 7 tree 8 NA <NA>
4 4 tree tree 9 NA tree
5 5 4 tree 1 NA tree
I would flag the var.2
var.4
and var.5
variables since they have all tree
or all NA
or both NA
and tree
values in it. When there is a numeric value in the column, I do not want to flag that column.
flagged [1] "var.2" "var.4" "var.5"
Any ideas? Thanks!
CodePudding user response:
Here is an option with tidyverse
- select
the columns that have only NA
values (all(is.na
) or (|
) having only 'tree'
as value (all(.x == "tree"
)
library(dplyr)
df %>%
select(where(~ all(is.na(.x))| all(.x == "tree", na.rm = TRUE))) %>%
names
[1] "var.2" "var.4" "var.5"
Or using base R
names(df)[!colSums(df != "tree", na.rm = TRUE)]
[1] "var.2" "var.4" "var.5"
CodePudding user response:
Use of |
which is virtually arbitrarily expandable and which colSums
equal nrow
.
names(which(colSums(df == 'tree' | is.na(df)) == nrow(df)))
# [1] "var.2" "var.4" "var.5"