Performance of Grouped Operations with data.table-CodePudding

I will start by describing the task that I am performing. I have to repeatedly calculate a grouped sum, typically between 5 and 10 times. The values in the column that I am performing the grouped sum on change with each iteration, but the columns that I am grouping by do not. Below is an example table, in which w, x, and y together constitute the grouping and z is the column whose values will be summed.

DT <- data.table(w = sample(1:10, size  = 1000000, replace = TRUE), 
                 x = sample(1:100, size  = 1000000, replace = TRUE), 
                 y = sample(1:1000, size  = 1000000, replace = TRUE),
                 z = runif(n = 1000000, min = 0, max = 1))

setkey(DT, w, x, y)

My initial solution was, I think, the most obvious:

DT[, Group_Total := sum(z), keyby = list(w, x, y)]

microbenchmark(DT[, Group_Total := sum(z), keyby = list(w, x, y)])
Unit: milliseconds
min       lq        mean      median    uq        max  neval
447.8674  467.3366  481.1989  479.3057  490.5499  691  100

I found that this solution was not nearly as fast as I hoped, especially since I had to perform this calculation several times. However, as far as I could tell, there was not a method within data.table that would provide better performance than this. Recently, I began experimenting with Rccp and I wrote the below function to perform a grouped sum:

cppFunction('NumericVector Group_Total(NumericVector Grouping, NumericVector x) {
                          
                          int n = Grouping.size();
                          NumericVector out(n);
                          
                          int curr_start_index = 0;
                          double grp_total = x[0];
                          
                          for(int i = 1; i < (n-1);   i) {
                          
                            if(Grouping[i] == 1) {
                            
                              for(int j = curr_start_index; j < i;   j) {
                              
                                out[j] = grp_total;  
                              
                              }
                              
                              curr_start_index = i;
                              grp_total = x[i];
                              
                            } else {
                              
                                grp_total = grp_total   x[i];
                              
                              } 
                              
                            }
                            
                          if(Grouping[(n-1)] == 1) {
                              
                              for(int j = curr_start_index; j < (n-1);   j) {
                              
                                out[j] = grp_total;  
                              
                              }
                              
                              out[(n-1)] = x[(n-1)];
                                
                            } else {
                              
                                grp_total = grp_total   x[(n-1)];
                                
                                for(int j = curr_start_index; j < n;   j) {
                                  
                                  out[j] = grp_total;  
                              
                                }
                              
                            }
                              
                        return out;
              }')

To use this function, I calculate a within group index column once the data is ordered:

DT[, Within_Group_Index := 1][, Within_Group_Index := cumsum(Within_Group_Index), keyby = list(w, x, y)]

Since I only need to calculate this column once, the time added is not significant. Once this column is created, the grouped sum can be calculated via :

DT[, Group_Total := Group_Total(Within_Group_Index, z)]

microbenchmark(DT[, Group_Total := Group_Total(Within_Group_Index, z)])
Unit: milliseconds
min     lq      mean      median  uq      max       neval
5.3286  6.9879  9.078267  7.4814  8.0171  161.6406  100

This is faster by a significant margin, and now we come to my question: why? My understanding is that grouped operations are very efficient in data.table. Is there something that I can do to speed up the data.table grouped operation, or is this an example of a fundamental speed vs. flexibility tradeoff, i.e. the flexibility of data.table's grouped operations comes at the cost of speed when considering specific use cases?

CodePudding user response：

I think your suspicion is correct. As to why, we could make an educated guess.

The C code you wrote doesn't modify anything from its input data, it only has to allocate out. As you correctly said, data.table aims to be more flexible, and must support any valid R code the users provide. That makes me think that, for grouped operations, it would have to allocate new R vectors for each subset of z according to the group indices, effectively copying some of the input multiple times.

I wanted to try to validate my assumption, so I wrote some C code to look at memory addresses:

set.seed(31L)
DT <- data.table(w = sample(1:3, size  = 20, replace = TRUE), 
                 x = sample(1:3, size  = 20, replace = TRUE), 
                 y = sample(1:3, size  = 20, replace = TRUE),
                 z = runif(n = 20, min = 0, max = 1))

setkey(DT, w, x, y)

cppFunction(plugins = "cpp11", includes = "#include <sstream>", '
    StringVector addrs(NumericVector z, IntegerVector indices, IntegerVector group_ids) {
        Rcout << "input length = " << z.size() << "\\n";
        StringVector ans(indices.size());
        for (auto i = 0; i < indices.size();   i) {
            std::ostringstream oss;
            oss << "Group" << group_ids[i] << " " << &z[indices[i]];
            ans[i] = oss.str();
        }
        return ans;
    }
')

idx <- DT[, .(indices = .I[1L] - 1L, group_ids = .GRP), keyby = list(w, x, y)]

DT[, list(addrs(z, idx$indices, idx$group_ids))]
#                         V1
#  1:  Group1 0x55b7733361f8
#  2:  Group2 0x55b773336200
#  3:  Group3 0x55b773336210
#  4:  Group4 0x55b773336220
#  5:  Group5 0x55b773336228
#  6:  Group6 0x55b773336230
#  7:  Group7 0x55b773336238
#  8:  Group8 0x55b773336248
#  9:  Group9 0x55b773336250
# 10: Group10 0x55b773336258
# 11: Group11 0x55b773336260
# 12: Group12 0x55b773336270
# 13: Group13 0x55b773336280
# 14: Group14 0x55b773336288
# 15: Group15 0x55b773336290

As expected here, if we look at z as a whole, no copy takes place and the addresses of the different elements are close to each other, for example 0x55b773336200 = 0x55b7733361f8 0x8. You can execute the last line from the previous code multiple times and it will always show the same addresses.

What I partially didn't expect is this:

DT[, list(addrs(z, 0L, .GRP)), keyby = list(w, x, y)]
#     w x y                     V1
#  1: 1 1 2  Group1 0x55b771b03440
#  2: 1 1 3  Group2 0x55b771b03440
#  3: 1 2 1  Group3 0x55b771b03440
#  4: 1 2 2  Group4 0x55b771b03440
#  5: 1 3 2  Group5 0x55b771b03440
#  6: 1 3 3  Group6 0x55b771b03440
#  7: 2 1 1  Group7 0x55b771b03440
#  8: 2 2 1  Group8 0x55b771b03440
#  9: 2 2 2  Group9 0x55b771b03440
# 10: 2 2 3 Group10 0x55b771b03440
# 11: 2 3 1 Group11 0x55b771b03440
# 12: 3 2 1 Group12 0x55b771b03440
# 13: 3 2 2 Group13 0x55b771b03440
# 14: 3 2 3 Group14 0x55b771b03440
# 15: 3 3 3 Group15 0x55b771b03440

On the one hand, the address in memory did change, so something was copied, and if you run this multiple times you will see different addresses each time. However, it seems like data.table somehow reuses a buffer with the same address, maybe allocating one array with the length of the biggest group-subset and copying the different group values to it? I wonder how they manage that. Or maybe my code is wrong ¯\_(ツ)_/¯

EDIT: I added the following as the first line of addrs (double escape because of R parsing)

Rcout << "input length = " << z.size() << "\\n";

and if I run the last DT[...] code from above, it does print different lengths even though the address is the same.

CodePudding user response：

I found that the initial code was faster on dev branch version 1.14.3. From the NEWS:

:= is now optimized by group

I saw a similar timing to yours, but after upgrading see that the original code is pretty fast:

microbenchmark(
    direct = DT[, Group_Total := sum(z), keyby = list(w, x, y)],
    indirect = DT[, idx := rowid(w,x,y)][, gt := Group_Total(idx, z)]
)

Unit: milliseconds
     expr      min       lq     mean   median       uq      max neval
   direct 42.92247 44.61097 48.77803 45.48076 46.93751 72.29237   100
 indirect 30.49422 31.95343 34.69138 33.18616 35.32934 58.37454   100

Notes:

rowid constructs the within-group index.
DT[, Group_Total := sum(z), keyby = list(w, x, y), verbose=TRUE] will display info on whether GForce optimization is being used.