In R's data.table
, one can chain multiple operations by putting together squared braces, each of which will be able to use non-standard evaluation of e.g. column names for whatever is the current transformation in a chain - for example:
dt[,
.(agg1=mean(var1), agg2=mean(var2)),
by=.(col1, col2)
][,
.(agg2=ceiling(agg2), agg3=agg2^2)
]
Suppose I want to do an operation which involves computing some function that would take as input a data.frame
, which I want to put in a data.table chain taking the current object in the chain, while using these results in e.g. a by
clause. In magrittr
/dplyr
chains, I can use a .
to refer to the object that is passed to a pipe and do arbitrary operations with it, but in data.table
the chains work quite differently.
For example, suppose I have this table and I want to compute a percentage of total across groups:
library(data.table)
dt = data.table(col1 = c(1,1,1,2,2,3), col2=10:15)
dt[, .(agg1 = sum(col2)/sum(dt$col2)), by=.(col1)]
col1 agg1
1: 1 0.44
2: 2 0.36
3: 3 0.20
In this snippet, I referred to the full object name inside [
to compute the total of the column as sum(dt$col2)
, which bypasses the by
part. But if it were in a chain and these columns were calculated through other operations, I wouldn't have been able to simply use dt
like that as that's an external variable (that is, it's evaluated in an environment level outside the [
scope).
How can I refer to the current object/table inside a chain? To be clear, I am not asking about how to do this specific grouping operation, but about a general way of accessing the current object so as to use it in arbitrary functions.
CodePudding user response:
We could use .SDcols
or directly .SD
and select the column as a vector with [[
dt[, .(agg1 = sum(col2)/sum(dt$col2)), by=.(col1)][,
.(agg1 = ceiling(.SD[["agg1"]]))]