Home > Mobile >  R data.table: how to refer to current object in a chain?
R data.table: how to refer to current object in a chain?

Time:09-16

In R's data.table, one can chain multiple operations by putting together squared braces, each of which will be able to use non-standard evaluation of e.g. column names for whatever is the current transformation in a chain - for example:

dt[,
    .(agg1=mean(var1), agg2=mean(var2)),
    by=.(col1, col2)
][,
    .(agg2=ceiling(agg2), agg3=agg2^2)
]

Suppose I want to do an operation which involves computing some function that would take as input a data.frame, which I want to put in a data.table chain taking the current object in the chain, while using these results in e.g. a by clause. In magrittr/dplyr chains, I can use a . to refer to the object that is passed to a pipe and do arbitrary operations with it, but in data.table the chains work quite differently.

For example, suppose I have this table and I want to compute a percentage of total across groups:

library(data.table)
dt = data.table(col1 = c(1,1,1,2,2,3), col2=10:15)
dt[, .(agg1 = sum(col2)/sum(dt$col2)), by=.(col1)]
   col1 agg1
1:    1 0.44
2:    2 0.36
3:    3 0.20

In this snippet, I referred to the full object name inside [ to compute the total of the column as sum(dt$col2), which bypasses the by part. But if it were in a chain and these columns were calculated through other operations, I wouldn't have been able to simply use dt like that as that's an external variable (that is, it's evaluated in an environment level outside the [ scope).

How can I refer to the current object/table inside a chain? To be clear, I am not asking about how to do this specific grouping operation, but about a general way of accessing the current object so as to use it in arbitrary functions.

CodePudding user response:

We could use .SDcols or directly .SD and select the column as a vector with [[

dt[, .(agg1 = sum(col2)/sum(dt$col2)), by=.(col1)][, 
       .(agg1 = ceiling(.SD[["agg1"]]))]
  • Related