I a have function that uses the data.table
syntax. The function works but is way too slow.
I know there has to be a better way to do this. This is my function and some setup for reproducibility:
library(tidyverse)
library(data.table)
sw <- data.table(starwars)
# Large data-set to make difference noticeable
for (i in 1:15) {
sw <- rbind(sw, sw)
}
# My working but slow function
dt.fun <- function(expr) {
sw[, .("mean_bin" = mean(eval(parse(text = expr)),
na.rm = T)),
by = species]
}
The function takes an expression and calculates the mean of that expression within each species. The function must be able to take any expression that a regular user could type in the command window, i.e. it must be able to calculate all of the following:
dt.fun("height > 150")
dt.fun("hair_color %in% c('brown', 'blond')")
dt.fun("mass")
The problem is that my function takes too long and is clearly not leveraging data.table
's power:
# Using dt.fun()
tictoc::tic()
dt.fun("height > 150")
tictoc::toc()
# 2.23 sec elapsed
# Inputting the expression myself
tictoc::tic()
sw[, .("mean_bin" = mean(height > 150, na.rm = T)),
by = species]
tictoc::toc()
# 0.28 sec elapsed
My questions are:
- How can I rewrite my function so that it's faster?
- What other improvements would you make?
Thanks for your help!
CodePudding user response:
I replicate your finding that your dt.fun is ~1/10 as fast as using hardcoded expressions. I don't have the sw data. My sample data has 1.6 million rows, and data.table is using 32 cores.
tic(); sw[,mean(duration>60),group]; toc()
...
0.323 sec elapsed
tic(); dt.fun('duration>60'); toc()
...
3.205 sec elapsed
Pre-parsing the expression makes no difference:
dt.fun <- function(expr) {
parsed=parse(text = expr)
sw[, .("mean_bin" = mean(eval(parsed),na.rm = T)), group]
}
tic(); dt.fun('duration>60'); toc()
...
3.832 sec elapsed
Pre-defining the function makes up all or most of the difference:
f1=eval(parse(text='function(duration)duration>60'))
dt.fun1 <- function(f) {
sw[, .("mean_bin" = mean(f(duration),na.rm = T)), group]
}
tic(); dt.fun(f1); toc()
...
0.341 sec elapsed
Pre-defining the function dynamically also works:
expr='duration>60'
f1=eval(parse(text=paste0('function(duration)',expr)))
tic(); dt.fun(f1); toc()
...
0.339 sec elapsed
Using eval once per expr instead of inside the data.table j-expression solves the problem, e.g.,:
dt.fun <- function(expr) {
f = eval(parse(text=paste0(
'function(sw) sw[,.("mean_bin" = mean(',expr,',na.rm=T)),group]'
)))
f(sw)
}
tic(); dt.fun('duration>60'); toc()
...
0.33 sec elapsed
PS: If this is for a web application, you indeed shouldn't use eval(parse...)) because of the risk of code injection. Instead, you could implement an accepted list of operators and column names. For example, something like:
allowed_operators = c(' ', '–', '*', '/', '^', '%%', '%/%', '<', '>', '<=', '>=', '==', '!=', '!', '&', '&&', '|', '||')
cols = colnames(sw)
if(
length(operator)==1 && length(operands)==2 &&
operator %in% allowed_operators &&
(all(is.numeric(operands) | operands %in% cols))
){
env=list(.cond=call(operator, as.name(operand[1]), operand[2]))
}
...but of course this severely limits the set of handled expressions. (ref: jangorecki's comment below.)