How to select columns with equal or more than 2 unique values while ignoring NA and blank?-CodePudding

My dataframe looks similar to this:

 df <- data.frame(ID = c(1, 2, 3, 4, 5),
               color = c(NA, "black", "black", NA, "brown"),
              animal = c("dog", "", "", "", "")
               owner = c("YES", "NO", "NO", "YES", NA))

ID	color	animal	owner
1	NA	dog	YES
2	black		NO
3	black		NO
4	NA		YES
5	brown		NA

I would like to retrieve the column names of all columns with more than 2 unique values while ignoring NA and blanks/empty strings in this calculation.

My solution so far:

df_col <- df %>% 
        select_if(function(col) length(unique(na.omit(col)))>1)

df_col <- colnames(df_col)

But I have noticed that na.omit() won't help, since it deletes the whole row.

Any help would be appreciated. Thank you in advance!

CodePudding user response：

Use n_distinct, which also have na.rm argument, The _if/_at/_all are deprecated in favor of across/where. The empty strings ('') can be checked with nzchar which returns a TRUE only if it is non-empty, thus subset the elements of the columns with nzchar and then apply n_distinct column wise and create the condition to select only those columns and then get the names

library(dplyr)
df %>%
    select(where(~ n_distinct(.x[nzchar(.x)], na.rm = TRUE) > 1)) %>%
     names

-output

[1] "ID"    "color" "owner"

An option is also to convert the "" to NA (na_if), perhaps it may be slightly compact

df %>% 
  select(where(~ n_distinct(na_if(.x, ""), na.rm = TRUE) > 1)) %>% 
  names

CodePudding user response：

You can do replace values with "" with NA (na_if), and then use lengths to count the number of unique values. Use names and which to get the vector of names that have more than two values.

names(which(lengths(lapply(na_if(df, ""), \(x) unique(x[!is.na(x)]))) >= 2))
[1] "ID"    "color" "owner"

Combining with n_distinct:

colnames(df)[lapply(na_if(df, ""), n_distinct, na.rm = T) >= 2]
[1] "ID"    "color" "owner"