I will start by describing the task that I am performing. I have to repeatedly calculate a grouped sum, typically between 5 and 10 times. The values in the column that I am performing the grouped sum on change with each iteration, but the columns that I am grouping by do not. Below is an example table, in which w, x, and y together constitute the grouping and z is the column whose values will be summed.
DT <- data.table(w = sample(1:10, size = 1000000, replace = TRUE),
x = sample(1:100, size = 1000000, replace = TRUE),
y = sample(1:1000, size = 1000000, replace = TRUE),
z = runif(n = 1000000, min = 0, max = 1))
setkey(DT, w, x, y)
My initial solution was, I think, the most obvious:
DT[, Group_Total := sum(z), keyby = list(w, x, y)]
microbenchmark(DT[, Group_Total := sum(z), keyby = list(w, x, y)])
Unit: milliseconds
min lq mean median uq max neval
447.8674 467.3366 481.1989 479.3057 490.5499 691 100
I found that this solution was not nearly as fast as I hoped, especially since I had to perform this calculation several times. However, as far as I could tell, there was not a method within data.table that would provide better performance than this. Recently, I began experimenting with Rccp and I wrote the below function to perform a grouped sum:
cppFunction('NumericVector Group_Total(NumericVector Grouping, NumericVector x) {
int n = Grouping.size();
NumericVector out(n);
int curr_start_index = 0;
double grp_total = x[0];
for(int i = 1; i < (n-1); i) {
if(Grouping[i] == 1) {
for(int j = curr_start_index; j < i; j) {
out[j] = grp_total;
}
curr_start_index = i;
grp_total = x[i];
} else {
grp_total = grp_total x[i];
}
}
if(Grouping[(n-1)] == 1) {
for(int j = curr_start_index; j < (n-1); j) {
out[j] = grp_total;
}
out[(n-1)] = x[(n-1)];
} else {
grp_total = grp_total x[(n-1)];
for(int j = curr_start_index; j < n; j) {
out[j] = grp_total;
}
}
return out;
}')
To use this function, I calculate a within group index column once the data is ordered:
DT[, Within_Group_Index := 1][, Within_Group_Index := cumsum(Within_Group_Index), keyby = list(w, x, y)]
Since I only need to calculate this column once, the time added is not significant. Once this column is created, the grouped sum can be calculated via :
DT[, Group_Total := Group_Total(Within_Group_Index, z)]
microbenchmark(DT[, Group_Total := Group_Total(Within_Group_Index, z)])
Unit: milliseconds
min lq mean median uq max neval
5.3286 6.9879 9.078267 7.4814 8.0171 161.6406 100
This is faster by a significant margin, and now we come to my question: why? My understanding is that grouped operations are very efficient in data.table. Is there something that I can do to speed up the data.table grouped operation, or is this an example of a fundamental speed vs. flexibility tradeoff, i.e. the flexibility of data.table's grouped operations comes at the cost of speed when considering specific use cases?
CodePudding user response:
I think your suspicion is correct. As to why, we could make an educated guess.
The C code you wrote doesn't modify anything from its input data,
it only has to allocate out
.
As you correctly said,
data.table
aims to be more flexible,
and must support any valid R code the users provide.
That makes me think that,
for grouped operations,
it would have to allocate new R vectors for each subset of z
according to the group indices,
effectively copying some of the input multiple times.
I wanted to try to validate my assumption, so I wrote some C code to look at memory addresses:
set.seed(31L)
DT <- data.table(w = sample(1:3, size = 20, replace = TRUE),
x = sample(1:3, size = 20, replace = TRUE),
y = sample(1:3, size = 20, replace = TRUE),
z = runif(n = 20, min = 0, max = 1))
setkey(DT, w, x, y)
cppFunction(plugins = "cpp11", includes = "#include <sstream>", '
StringVector addrs(NumericVector z, IntegerVector indices, IntegerVector group_ids) {
Rcout << "input length = " << z.size() << "\\n";
StringVector ans(indices.size());
for (auto i = 0; i < indices.size(); i) {
std::ostringstream oss;
oss << "Group" << group_ids[i] << " " << &z[indices[i]];
ans[i] = oss.str();
}
return ans;
}
')
idx <- DT[, .(indices = .I[1L] - 1L, group_ids = .GRP), keyby = list(w, x, y)]
DT[, list(addrs(z, idx$indices, idx$group_ids))]
# V1
# 1: Group1 0x55b7733361f8
# 2: Group2 0x55b773336200
# 3: Group3 0x55b773336210
# 4: Group4 0x55b773336220
# 5: Group5 0x55b773336228
# 6: Group6 0x55b773336230
# 7: Group7 0x55b773336238
# 8: Group8 0x55b773336248
# 9: Group9 0x55b773336250
# 10: Group10 0x55b773336258
# 11: Group11 0x55b773336260
# 12: Group12 0x55b773336270
# 13: Group13 0x55b773336280
# 14: Group14 0x55b773336288
# 15: Group15 0x55b773336290
As expected here,
if we look at z
as a whole,
no copy takes place and the addresses of the different elements are close to each other,
for example 0x55b773336200 = 0x55b7733361f8 0x8
.
You can execute the last line from the previous code multiple times and it will always show the same addresses.
What I partially didn't expect is this:
DT[, list(addrs(z, 0L, .GRP)), keyby = list(w, x, y)]
# w x y V1
# 1: 1 1 2 Group1 0x55b771b03440
# 2: 1 1 3 Group2 0x55b771b03440
# 3: 1 2 1 Group3 0x55b771b03440
# 4: 1 2 2 Group4 0x55b771b03440
# 5: 1 3 2 Group5 0x55b771b03440
# 6: 1 3 3 Group6 0x55b771b03440
# 7: 2 1 1 Group7 0x55b771b03440
# 8: 2 2 1 Group8 0x55b771b03440
# 9: 2 2 2 Group9 0x55b771b03440
# 10: 2 2 3 Group10 0x55b771b03440
# 11: 2 3 1 Group11 0x55b771b03440
# 12: 3 2 1 Group12 0x55b771b03440
# 13: 3 2 2 Group13 0x55b771b03440
# 14: 3 2 3 Group14 0x55b771b03440
# 15: 3 3 3 Group15 0x55b771b03440
On the one hand,
the address in memory did change,
so something was copied,
and if you run this multiple times you will see different addresses each time.
However, it seems like data.table
somehow reuses a buffer with the same address,
maybe allocating one array with the length of the biggest group-subset and copying the different group values to it?
I wonder how they manage that.
Or maybe my code is wrong ¯\_(ツ)_/¯
EDIT: I added the following as the first line of addrs
(double escape because of R parsing)
Rcout << "input length = " << z.size() << "\\n";
and if I run the last DT[...]
code from above,
it does print different lengths even though the address is the same.
CodePudding user response:
I found that the initial code was faster on dev branch version 1.14.3. From the NEWS:
- := is now optimized by group
I saw a similar timing to yours, but after upgrading see that the original code is pretty fast:
microbenchmark(
direct = DT[, Group_Total := sum(z), keyby = list(w, x, y)],
indirect = DT[, idx := rowid(w,x,y)][, gt := Group_Total(idx, z)]
)
Unit: milliseconds
expr min lq mean median uq max neval
direct 42.92247 44.61097 48.77803 45.48076 46.93751 72.29237 100
indirect 30.49422 31.95343 34.69138 33.18616 35.32934 58.37454 100
Notes:
rowid
constructs the within-group index.DT[, Group_Total := sum(z), keyby = list(w, x, y), verbose=TRUE]
will display info on whether GForce optimization is being used.