Improve runtime of group_by and summarize-CodePudding

I have a data frame df of around 10 million employees. Each employee has an ID and there is a city variable and a company variable that shows where they work:

ID city company
1  NYC  ABC
2  BOS  ABC
1  NYC  DEF
3  SEA  GHI

I want to group_by ID and city and then figure out the number of companies each employee works for:

ID city count
1  NYC  2
2  BOS  1
3  SEA  1

My code is df %>% group_by(ID, city) %>% summarize(count = n_distinct(company)) However, this is taking a long time to run. My set-up typically supports computationally heavy operations, but I am having a hard time running this chunk of code.

Does anyone have a different (ideally faster implementation)?

Edit: n_distinct(company) instead of n_distinct(ID)

CodePudding user response：

Here is a benchmark comparison of a data.table and dplyr solution using a larger (and hopefully representative?) dataset.

# Sample data
df <- data.frame(
    ID = sample(c(1:10), 10000, replace = TRUE),
    city = sample(c("NYC", "BOS", "SEA"), 10000, replace = TRUE),
    company = sample(c("ABC", "DEF", "GHI","JKL", "MNO", "PQR", "STU"), 10000, replace = TRUE))

library(data.table)
dt <- as.data.table(df)

library(microbenchmark)
res <- microbenchmark(
    dplyr = df %>% 
        group_by(ID, city) %>% 
        summarise(count = n_distinct(company), .groups = "drop"),
    datatable = dt[, .(count = uniqueN(company)), by = c("ID", "city")]
)
res
#Unit: milliseconds
#     expr      min       lq      mean   median        uq      max neval
#    dplyr 6.843204 7.696959 10.110256 9.915225 10.906205 41.35438   100
#datatable 1.893994 2.255023  2.924953 2.738450  3.395504  7.61165   100

autoplot(res)

CodePudding user response：

We have another option. We can use the collapse package, which is C/C based package for data manipulation. The syntax is similar to dplyr. In the following example, fgroup_by, fsummarise, and fndistinct are from the collapse package, which perform the same task as group_by, summarise, and n_distinct from dplyr.

library(dplyr)
library(collapse)

dat <- read.table(textConnection("ID city company
1  NYC  ABC
2  BOS  ABC
1  NYC  DEF
3  SEA  GHI"), header = TRUE)

dat %>%
  fgroup_by(ID, city) %>%
  fsummarise(count = fndistinct(company))
#   ID city count
# 1  1  NYC     2
# 2  2  BOS     1
# 3  3  SEA     1

Below is a speed comparison. I modified the examples and code from @Maurits Evers.

library(dplyr)
library(data.table)
library(collapse)
library(ggplot2)
library(microbenchmark)

# Sample data
df <- data.frame(
  ID = sample(c(1:10), 10000, replace = TRUE),
  city = sample(c("NYC", "BOS", "SEA"), 10000, replace = TRUE),
  company = sample(c("ABC", "DEF", "GHI","JKL", "MNO", "PQR", "STU"), 10000, replace = TRUE))

dt <- as.data.table(df)

res <- microbenchmark(
  dplyr = df %>% 
    group_by(ID, city) %>% 
    summarise(count = n_distinct(company), .groups = "drop"),
  datatable = dt[, .(count = uniqueN(company)), by = c("ID", "city")],
  collapse = df %>%
    fgroup_by(ID, city) %>%
    fsummarise(count = fndistinct(company))
)
res
# Unit: microseconds
#      expr      min       lq     mean   median        uq       max neval cld
#     dplyr 4885.501 5776.400 6658.784 6108.901 7073.8015 16177.301   100   c
# datatable 2705.501 3173.351 3500.120 3332.501 3643.5510  9076.502   100  b 
#  collapse  283.001  347.251  482.880  402.901  452.2505  7740.401   100 a