Home > Software engineering >  data.table average by group provide different results when using mean() and sum()/.N
data.table average by group provide different results when using mean() and sum()/.N

Time:04-25

When calculating the average by group in a data.table I get distinct results:

qty <- c(1:6)
name <- c("a", "b","a", "a", "c","b" )
type <- c("i", "i", "i", "f", "f", "f")

DT <- data.table(qty,name,type) 

DT[, avg_mean  := mean(qty)   , by = .(name, type)]
DT[, avg_sum_N := sum(qty)/.N , by = .(name, type)]

 > DT
     qty   name   type avg_mean avg_sum_N
   <int> <char> <char>    <num>     <num>
1:     1      a      i        2         2
2:     2      b      i        4         2
3:     3      a      i        2         2
4:     4      a      f        2         4
5:     5      c      f        6         5
6:     6      b      f        5         6

I would expect that avg_mean and avg_sum_N would be exactly the same, such as avg_sum_N. Why are they different? Thank you.

Please find below session info.

> packageVersion('data.table')
[1] ‘1.14.3’
> sessionInfo()
R version 4.1.0 (2021-05-18)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19044)

Matrix products: default

locale:
[1] LC_COLLATE=Portuguese_Brazil.1252  LC_CTYPE=Portuguese_Brazil.1252    LC_MONETARY=Portuguese_Brazil.1252
[4] LC_NUMERIC=C                       LC_TIME=Portuguese_Brazil.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] zoo_1.8-10        lubridate_1.8.0   RPostgres_1.4.3   DBI_1.1.2         stringi_1.7.6     readxl_1.4.0     
 [7] gsubfn_0.7        proto_1.0.0       stringr_1.4.0     magrittr_2.0.3    stringdist_0.9.8  fuzzyjoin_0.1.6  
[13] data.table_1.14.3

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.8.3     pillar_1.7.0     compiler_4.1.0   cellranger_1.1.0 tools_4.1.0      bit_4.0.4       
 [7] lattice_0.20-44  lifecycle_1.0.1  tibble_3.1.6     pkgconfig_2.0.3  rlang_1.0.2      cli_3.2.0       
[13] rstudioapi_0.13  writexl_1.4.0    parallel_4.1.0   dplyr_1.0.8      hms_1.1.1        generics_0.1.2  
[19] vctrs_0.4.1      grid_4.1.0       bit64_4.0.5      tidyselect_1.1.2 glue_1.6.2       R6_2.5.1        
[25] fansi_1.0.3      tcltk_4.1.0      blob_1.2.3       purrr_0.3.4      ellipsis_0.3.2   assertthat_0.2.1
[31] utf8_1.2.2       crayon_1.5.1

CodePudding user response:

It is related to the GForce optimization for sum and mean which use gsum and gmean. It can be either set as FALSE with options

options(datatable.optimize=1)

Or may specifically use base::mean, base::sum

DT[, avg_mean  := base::mean(qty)   , by = .(name, type)]
DT[, avg_sum_N := base::sum(qty)/.N , by = .(name, type)]

It would be revealed with verbose

> DT[, avg_mean  := mean(qty)   , by = .(name, type), verbose = TRUE]
Argument 'by' after substitute: .(name, type)
Detected that j uses these columns: [avg_mean, qty]
Finding groups using forderv ... forder.c received 6 rows and 2 columns
0.001s elapsed (0.001s cpu) 
Finding group sizes from the positions (can be avoided to save RAM) ... 0.001s elapsed (0.000s cpu) 
Getting back original order ... forder.c received a vector type 'integer' length 5
0.001s elapsed (0.001s cpu) 
lapply optimization is on, j unchanged as 'mean(qty)'
GForce optimized j to 'gmean(qty)'
Making each group and running j (GForce TRUE) ... gforce initial population of grp took 0.000
gforce assign high and low took 0.000
This gmean took (narm=FALSE) ... gather took ... 0.000s
0.000s
gforce eval took 0.000
0.001s elapsed (0.002s cpu) 
Assigning to 6 row subset of 6 rows
RHS_list_of_columns == false
> DT[, avg_mean  := base::mean(qty)   , by = .(name, type), verbose = TRUE]
Argument 'by' after substitute: .(name, type)
Detected that j uses these columns: [avg_mean, qty]
Finding groups using forderv ... forder.c received 6 rows and 2 columns
0.002s elapsed (0.001s cpu) 
Finding group sizes from the positions (can be avoided to save RAM) ... 0.001s elapsed (0.001s cpu) 
Getting back original order ... forder.c received a vector type 'integer' length 5
0.001s elapsed (0.001s cpu) 
lapply optimization is on, j unchanged as 'base::mean(qty)'
GForce is on, left j unchanged
Old mean optimization is on, left j unchanged.
Making each group and running j (GForce FALSE) ... 
  collecting discontiguous groups took 0.000s for 5 groups
  eval(j) took 0.000s for 5 calls
0.001s elapsed (0.001s cpu) 

If we want to make use of the GForce optimization, then we may need to order with the by columns first i.e. can use setkey

setkey(DT, name, type)
DT[, avg_sum_N := sum(qty)/.N , by = .(name, type)]
DT[, avg_mean  := mean(qty)   , by = .(name, type)]
> DT
Key: <name, type>
     qty   name   type avg_sum_N avg_mean
   <int> <char> <char>     <num>    <num>
1:     4      a      f         4        4
2:     1      a      i         2        2
3:     3      a      i         2        2
4:     6      b      f         6        6
5:     2      b      i         2        2
6:     5      c      f         5        5

CodePudding user response:

The problem was a bug in dev data.table version. data.table::update.dev.pkg() fixed the problem.

  • Related