My dataframe looks similar to this:
df <- data.frame(ID = c(1, 2, 3, 4, 5),
color = c(NA, "black", "black", NA, "brown"),
animal = c("dog", "", "", "", "")
owner = c("YES", "NO", "NO", "YES", NA))
ID | color | animal | owner |
---|---|---|---|
1 | NA | dog | YES |
2 | black | NO | |
3 | black | NO | |
4 | NA | YES | |
5 | brown | NA |
I would like to retrieve the column names of all columns with more than 2 unique values while ignoring NA and blanks/empty strings in this calculation.
My solution so far:
df_col <- df %>%
select_if(function(col) length(unique(na.omit(col)))>1)
df_col <- colnames(df_col)
But I have noticed that na.omit() won't help, since it deletes the whole row.
Any help would be appreciated. Thank you in advance!
CodePudding user response:
Use n_distinct
, which also have na.rm
argument, The _if/_at/_all
are deprecated in favor of across/where
. The empty strings (''
) can be checked with nzchar
which returns a TRUE only if it is non-empty, thus subset the elements of the columns with nzchar
and then apply n_distinct
column wise and create the condition to select
only those columns and then get the names
library(dplyr)
df %>%
select(where(~ n_distinct(.x[nzchar(.x)], na.rm = TRUE) > 1)) %>%
names
-output
[1] "ID" "color" "owner"
An option is also to convert the ""
to NA
(na_if
), perhaps it may be slightly compact
df %>%
select(where(~ n_distinct(na_if(.x, ""), na.rm = TRUE) > 1)) %>%
names
CodePudding user response:
You can do replace values with "" with NA (na_if
), and then use lengths
to count the number of unique values. Use names
and which
to get the vector of names that have more than two values.
names(which(lengths(lapply(na_if(df, ""), \(x) unique(x[!is.na(x)]))) >= 2))
[1] "ID" "color" "owner"
Combining with n_distinct
:
colnames(df)[lapply(na_if(df, ""), n_distinct, na.rm = T) >= 2]
[1] "ID" "color" "owner"