Given a reproducible dataframe, I want to find the number of unique values in each column not including missing (NA) values. Below code counts NA values, as a result the cardinality of nat_country
column shows as 4 in n_unique_values
dataframe (it is supposed to be 3). In python there exists nunique()
function which does not take NA values into consideration. In r how can one achieve this?
nat_country = c("United-States", "Germany", "United-States", "United-States", "United-States", "United-States", "Taiwan", NA)
age = c(14,15,45,78,96,58,25,36)
dat = data.frame(nat_country, age)
n_unique_values = t(data.frame(apply(dat, 2, function(x) length(unique(x)))))
CodePudding user response:
You can use dplyr::n_distinct
with na.rm = T
:
library(dplyr)
sapply(dat, n_distinct, na.rm = T)
#map_dbl(dat, n_distinct, na.rm = T)
#nat_country age
# 3 8
In base R, you can use na.omit
as well:
sapply(dat, \(x) length(unique(na.omit(x))))
#nat_country age
# 3 8
CodePudding user response:
We could use map
or map_dfr
with n_distinct
:
library(dplyr)
library(purrr)
dat %>%
map_dfr(., n_distinct, na.rm = TRUE)
nat_country age
<int> <int>
1 3 8
library(dplyr)
library(purrr)
dat %>%
map(., n_distinct, na.rm = TRUE) %>%
unlist()
nat_country age
3 8
CodePudding user response:
In base R you can use table
. It also has a parameter useNA
if you want to change the default behavior.
sapply(dat, function(x) length(table(x)))
nat_country age
3 8