I'm writing a function in R that takes a vector of column name variables as strings, and performs adjusted counting with n_distinct
like so:
library(tidyverse)
> packageVersion("tidyverse")
[1] ‘1.3.2’
vars <- c("Sepal.Length", "Petal.Width")
foo <- iris |>
group_by(Species) |>
mutate(
raw_count = n(),
adjusted_count = n_distinct(across(all_of(vars)))
)
For the Species
'setosa', this results in a raw count of 50 and an adjusted count of 28.
However I have a large dataset and this function is being used in an R Shiny app, so I'm trying to optimise where possible.
I have read that length(unique())
is faster than n_distinct()
, and I've seen some speedups in other functions, however for this use I have encountered two problems.
bar <- iris |>
group_by(Species) |>
mutate(
raw_count = n(),
adjusted_count = length(unique(across(all_of(vars))))
)
Replacing n_distinct()
in this case results in length(unique())
counting the number of distinct strings in the vars
vector (2), which is obviously not the desired result.
So I tested this using the actual variable names.
baz <- iris |>
group_by(Species) |>
mutate(
raw_count = n(),
adjusted_count = length(unique(Sepal.Length, Petal.Width))
)
For the Species
'setosa', this results in a raw count of 50 and an adjusted count of 15, and I am unsure as to why this is producing a different result to n_distinct
.
If anyone can explain the difference in results, and how to pass a character vector of column names to length(unique())
it would be greatly appreciated.
CodePudding user response:
If you're looking for a fast solution, you can try data.table::uniqueN
:
uniqueN
is equivalent tolength(unique(x))
when x is an atomic vector, andnrow(unique(x))
when x is a data.frame or data.table. The number of unique rows are computed directly without materialising the intermediate unique data.table and is therefore faster and memory efficient.
iris |>
group_by(Species) |>
mutate(
raw_count = n(),
adjusted_count = data.table::uniqueN(across(all_of(vars))))
)
The documentation also tells you the base R equivalent of n_distinct
in a data.frame is nrow(unique(x))
, not length(unique(x))
:
iris |>
group_by(Species) |>
mutate(
raw_count = n(),
adjusted_count = nrow(unique(across(all_of(vars))))
)
This is because length
applied to a dataframe counts the number of columns, not rows:
length(iris)
#[1] 5
Benchmark with a large dataframe (30,000 rows, 100 groups):
b <- data.frame(group = gl(300, 100),
var1 = rbinom(30000, 1, .5),
var2 = rbinom(30000, 1, .5)) |>
group_by(group)
vars <- c("var1", "var2")
bench::mark(baseR = mutate(b, adjusted_count = nrow(unique(across(all_of(vars))))),
dplyr = mutate(b, adjusted_count = n_distinct(across(all_of(vars)))),
data.table = mutate(b, adjusted_count = uniqueN(across(all_of(vars)))))
# expression min median itr/s…¹ mem_a…² gc/se…³ n_itr n_gc total…⁴
# <bch:expr> <bch:> <bch:> <dbl> <bch:b> <dbl> <int> <dbl> <bch:t>
#1 baseR 131ms 136ms 7.36 1.87MB 7.36 2 2 272ms
#2 dplyr 120ms 124ms 7.18 1.35MB 2.39 3 1 418ms
#3 data.table 88ms 109ms 9.19 1.12MB 4.60 2 1 218ms