Flag columns based on values in it in R-CodePudding

I am trying to identify data frame columns where the columns have a single character value tree.

Here is an example dataset.

df <- data.frame(id = c(1,2,3,4,5),
             var.1 = c(5,6,7,"tree",4),
             var.2 = c("tree","tree","tree","tree","tree"),
             var.3 = c(4,5,8,9,1),
             var.4 = c(NA,NA,NA,NA,NA),
             var.5 = c("tree","tree",NA,"tree","tree"))

> df
  id var.1 var.2 var.3 var.4 var.5
1  1     5  tree     4    NA  tree
2  2     6  tree     5    NA  tree
3  3     7  tree     8    NA  <NA>
4  4  tree  tree     9    NA  tree
5  5     4  tree     1    NA  tree

I would flag the var.2 var.4 and var.5 variables since they have all tree or all NA or both NA and tree values in it. When there is a numeric value in the column, I do not want to flag that column.

flagged [1] "var.2" "var.4" "var.5"

Any ideas? Thanks!

CodePudding user response：

Here is an option with tidyverse - select the columns that have only NA values (all(is.na) or (|) having only 'tree' as value (all(.x == "tree")

library(dplyr)
df %>% 
   select(where(~ all(is.na(.x))| all(.x == "tree", na.rm = TRUE))) %>% 
   names
[1] "var.2" "var.4" "var.5"

Or using base R

 names(df)[!colSums(df != "tree", na.rm = TRUE)]
[1] "var.2" "var.4" "var.5"

CodePudding user response：

Use of | which is virtually arbitrarily expandable and which colSums equal nrow.

names(which(colSums(df == 'tree' | is.na(df)) == nrow(df)))
# [1] "var.2" "var.4" "var.5"