Selecting multiple columns in n_distinct() and length(unique()) in R-CodePudding

I'm writing a function in R that takes a vector of column name variables as strings, and performs adjusted counting with n_distinct like so:

library(tidyverse)
> packageVersion("tidyverse")
[1] ‘1.3.2’

vars <- c("Sepal.Length", "Petal.Width")

foo <- iris |>
  group_by(Species) |>
  mutate(
    raw_count = n(),
    adjusted_count = n_distinct(across(all_of(vars)))
    )

For the Species 'setosa', this results in a raw count of 50 and an adjusted count of 28.

However I have a large dataset and this function is being used in an R Shiny app, so I'm trying to optimise where possible.

I have read that length(unique()) is faster than n_distinct(), and I've seen some speedups in other functions, however for this use I have encountered two problems.

bar <- iris |>
  group_by(Species) |>
  mutate(
    raw_count = n(),
    adjusted_count = length(unique(across(all_of(vars))))
  )

Replacing n_distinct() in this case results in length(unique()) counting the number of distinct strings in the vars vector (2), which is obviously not the desired result.

So I tested this using the actual variable names.

baz <- iris |>
  group_by(Species) |>
  mutate(
    raw_count = n(),
    adjusted_count = length(unique(Sepal.Length, Petal.Width))
  )

For the Species 'setosa', this results in a raw count of 50 and an adjusted count of 15, and I am unsure as to why this is producing a different result to n_distinct.

If anyone can explain the difference in results, and how to pass a character vector of column names to length(unique()) it would be greatly appreciated.

CodePudding user response：

If you're looking for a fast solution, you can try data.table::uniqueN:

uniqueN is equivalent to length(unique(x)) when x is an atomic vector, and nrow(unique(x)) when x is a data.frame or data.table. The number of unique rows are computed directly without materialising the intermediate unique data.table and is therefore faster and memory efficient.

iris |>
  group_by(Species) |>
  mutate(
    raw_count = n(),
    adjusted_count = data.table::uniqueN(across(all_of(vars))))
  )

The documentation also tells you the base R equivalent of n_distinct in a data.frame is nrow(unique(x)), not length(unique(x)):

iris |>
  group_by(Species) |>
  mutate(
    raw_count = n(),
    adjusted_count = nrow(unique(across(all_of(vars))))
  )

This is because length applied to a dataframe counts the number of columns, not rows:

length(iris)
#[1] 5

Benchmark with a large dataframe (30,000 rows, 100 groups):

b <- data.frame(group = gl(300, 100),
                var1 = rbinom(30000, 1, .5),
                var2 = rbinom(30000, 1, .5)) |>
  group_by(group)
vars <- c("var1", "var2")

bench::mark(baseR = mutate(b, adjusted_count = nrow(unique(across(all_of(vars))))),
            dplyr = mutate(b, adjusted_count = n_distinct(across(all_of(vars)))),
            data.table = mutate(b, adjusted_count = uniqueN(across(all_of(vars)))))

#  expression    min median itr/s…¹ mem_a…² gc/se…³ n_itr  n_gc total…⁴
#  <bch:expr> <bch:> <bch:>   <dbl> <bch:b>   <dbl> <int> <dbl> <bch:t>
#1 baseR       131ms  136ms    7.36  1.87MB    7.36     2     2   272ms
#2 dplyr       120ms  124ms    7.18  1.35MB    2.39     3     1   418ms
#3 data.table   88ms  109ms    9.19  1.12MB    4.60     2     1   218ms