Home > Net >  What's the best way to use data.table inside a function in R?
What's the best way to use data.table inside a function in R?

Time:09-09

I a have function that uses the data.table syntax. The function works but is way too slow. I know there has to be a better way to do this. This is my function and some setup for reproducibility:

library(tidyverse)
library(data.table)
sw <- data.table(starwars)

# Large data-set to make difference noticeable
for (i in 1:15) {
  sw <- rbind(sw, sw)
}

# My working but slow function
dt.fun <- function(expr) {
  
  sw[, .("mean_bin" = mean(eval(parse(text = expr)),
                                             na.rm = T)),
              by = species]
}

The function takes an expression and calculates the mean of that expression within each species. The function must be able to take any expression that a regular user could type in the command window, i.e. it must be able to calculate all of the following:

dt.fun("height > 150")

dt.fun("hair_color %in% c('brown', 'blond')")

dt.fun("mass")

The problem is that my function takes too long and is clearly not leveraging data.table's power:

# Using dt.fun()
tictoc::tic()
dt.fun("height > 150")
tictoc::toc()
# 2.23 sec elapsed


# Inputting the expression myself
tictoc::tic()
sw[, .("mean_bin" = mean(height > 150, na.rm = T)),
   by = species]
tictoc::toc()
# 0.28 sec elapsed

My questions are:

  • How can I rewrite my function so that it's faster?
  • What other improvements would you make?

Thanks for your help!

CodePudding user response:

I replicate your finding that your dt.fun is ~1/10 as fast as using hardcoded expressions. I don't have the sw data. My sample data has 1.6 million rows, and data.table is using 32 cores.

tic(); sw[,mean(duration>60),group]; toc()
...
0.323 sec elapsed

tic(); dt.fun('duration>60'); toc()
...
3.205 sec elapsed

Pre-parsing the expression makes no difference:

dt.fun <- function(expr) {
    parsed=parse(text = expr)
    sw[, .("mean_bin" = mean(eval(parsed),na.rm = T)), group]
}
tic(); dt.fun('duration>60'); toc()
...
3.832 sec elapsed

Pre-defining the function makes up all or most of the difference:

f1=eval(parse(text='function(duration)duration>60'))
dt.fun1 <- function(f) {
    sw[, .("mean_bin" = mean(f(duration),na.rm = T)), group]
}
tic(); dt.fun(f1); toc()
...
0.341 sec elapsed

Pre-defining the function dynamically also works:

expr='duration>60'
f1=eval(parse(text=paste0('function(duration)',expr)))
tic(); dt.fun(f1); toc()
...
0.339 sec elapsed

Using eval once per expr instead of inside the data.table j-expression solves the problem, e.g.,:

dt.fun <- function(expr) {
    f = eval(parse(text=paste0(
            'function(sw) sw[,.("mean_bin" = mean(',expr,',na.rm=T)),group]'
        )))
    f(sw)
}
tic(); dt.fun('duration>60'); toc()
...
0.33 sec elapsed

PS: If this is for a web application, you indeed shouldn't use eval(parse...)) because of the risk of code injection. Instead, you could implement an accepted list of operators and column names. For example, something like:

allowed_operators = c(' ', '–', '*', '/', '^', '%%', '%/%', '<', '>', '<=', '>=', '==', '!=', '!', '&', '&&', '|', '||')
cols = colnames(sw)
if(
  length(operator)==1 && length(operands)==2 &&
  operator %in% allowed_operators &&
  (all(is.numeric(operands) | operands %in% cols))
){
    env=list(.cond=call(operator, as.name(operand[1]), operand[2]))
}

...but of course this severely limits the set of handled expressions. (ref: jangorecki's comment below.)

  • Related