When calculating the average by group in a data.table I get distinct results:
qty <- c(1:6)
name <- c("a", "b","a", "a", "c","b" )
type <- c("i", "i", "i", "f", "f", "f")
DT <- data.table(qty,name,type)
DT[, avg_mean := mean(qty) , by = .(name, type)]
DT[, avg_sum_N := sum(qty)/.N , by = .(name, type)]
> DT
qty name type avg_mean avg_sum_N
<int> <char> <char> <num> <num>
1: 1 a i 2 2
2: 2 b i 4 2
3: 3 a i 2 2
4: 4 a f 2 4
5: 5 c f 6 5
6: 6 b f 5 6
I would expect that avg_mean
and avg_sum_N
would be exactly the same, such as avg_sum_N
.
Why are they different? Thank you.
Please find below session info.
> packageVersion('data.table')
[1] ‘1.14.3’
> sessionInfo()
R version 4.1.0 (2021-05-18)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19044)
Matrix products: default
locale:
[1] LC_COLLATE=Portuguese_Brazil.1252 LC_CTYPE=Portuguese_Brazil.1252 LC_MONETARY=Portuguese_Brazil.1252
[4] LC_NUMERIC=C LC_TIME=Portuguese_Brazil.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] zoo_1.8-10 lubridate_1.8.0 RPostgres_1.4.3 DBI_1.1.2 stringi_1.7.6 readxl_1.4.0
[7] gsubfn_0.7 proto_1.0.0 stringr_1.4.0 magrittr_2.0.3 stringdist_0.9.8 fuzzyjoin_0.1.6
[13] data.table_1.14.3
loaded via a namespace (and not attached):
[1] Rcpp_1.0.8.3 pillar_1.7.0 compiler_4.1.0 cellranger_1.1.0 tools_4.1.0 bit_4.0.4
[7] lattice_0.20-44 lifecycle_1.0.1 tibble_3.1.6 pkgconfig_2.0.3 rlang_1.0.2 cli_3.2.0
[13] rstudioapi_0.13 writexl_1.4.0 parallel_4.1.0 dplyr_1.0.8 hms_1.1.1 generics_0.1.2
[19] vctrs_0.4.1 grid_4.1.0 bit64_4.0.5 tidyselect_1.1.2 glue_1.6.2 R6_2.5.1
[25] fansi_1.0.3 tcltk_4.1.0 blob_1.2.3 purrr_0.3.4 ellipsis_0.3.2 assertthat_0.2.1
[31] utf8_1.2.2 crayon_1.5.1
CodePudding user response:
It is related to the GForce optimization for sum
and mean
which use gsum
and gmean
. It can be either set as FALSE with options
options(datatable.optimize=1)
Or may specifically use base::mean
, base::sum
DT[, avg_mean := base::mean(qty) , by = .(name, type)]
DT[, avg_sum_N := base::sum(qty)/.N , by = .(name, type)]
It would be revealed with verbose
> DT[, avg_mean := mean(qty) , by = .(name, type), verbose = TRUE]
Argument 'by' after substitute: .(name, type)
Detected that j uses these columns: [avg_mean, qty]
Finding groups using forderv ... forder.c received 6 rows and 2 columns
0.001s elapsed (0.001s cpu)
Finding group sizes from the positions (can be avoided to save RAM) ... 0.001s elapsed (0.000s cpu)
Getting back original order ... forder.c received a vector type 'integer' length 5
0.001s elapsed (0.001s cpu)
lapply optimization is on, j unchanged as 'mean(qty)'
GForce optimized j to 'gmean(qty)'
Making each group and running j (GForce TRUE) ... gforce initial population of grp took 0.000
gforce assign high and low took 0.000
This gmean took (narm=FALSE) ... gather took ... 0.000s
0.000s
gforce eval took 0.000
0.001s elapsed (0.002s cpu)
Assigning to 6 row subset of 6 rows
RHS_list_of_columns == false
> DT[, avg_mean := base::mean(qty) , by = .(name, type), verbose = TRUE]
Argument 'by' after substitute: .(name, type)
Detected that j uses these columns: [avg_mean, qty]
Finding groups using forderv ... forder.c received 6 rows and 2 columns
0.002s elapsed (0.001s cpu)
Finding group sizes from the positions (can be avoided to save RAM) ... 0.001s elapsed (0.001s cpu)
Getting back original order ... forder.c received a vector type 'integer' length 5
0.001s elapsed (0.001s cpu)
lapply optimization is on, j unchanged as 'base::mean(qty)'
GForce is on, left j unchanged
Old mean optimization is on, left j unchanged.
Making each group and running j (GForce FALSE) ...
collecting discontiguous groups took 0.000s for 5 groups
eval(j) took 0.000s for 5 calls
0.001s elapsed (0.001s cpu)
If we want to make use of the GForce optimization, then we may need to order with the by
columns first i.e. can use setkey
setkey(DT, name, type)
DT[, avg_sum_N := sum(qty)/.N , by = .(name, type)]
DT[, avg_mean := mean(qty) , by = .(name, type)]
> DT
Key: <name, type>
qty name type avg_sum_N avg_mean
<int> <char> <char> <num> <num>
1: 4 a f 4 4
2: 1 a i 2 2
3: 3 a i 2 2
4: 6 b f 6 6
5: 2 b i 2 2
6: 5 c f 5 5
CodePudding user response:
The problem was a bug in dev data.table version.
data.table::update.dev.pkg()
fixed the problem.