I am working with a data table that has a nested data table stored in a column:
fertName.x YDRange.x fertName.y YDRange.y dat
1: fertA 36-80 fertB 36-80 <data.table[7x3]>
2: fertA 36-80 fertC 36-80 <data.table[7x3]>
3: fertA 36-80 fertD 36-80 <data.table[7x3]>
where the dat
column of the first row looks like this:
FIELD_uniqueName OBS_numValue.x OBS_numValue.y
1: fieldA 5 3
2: fieldB 4 5
3: fieldC 7 5
4: fieldD 5 5
5: fieldE 4 5
6: fieldF 4 4
7: fieldG 5 7
For each row of the main data table, I need to do create a few new columns that will be based off calculations done on the values of the nested data table.
Three examples of such new columns would be:
n = .N, # count of rows of the nested data table,
vari = var(OBS_Value.x - OBS_Value.y), # variance of observed values
pvalue = t.test(OBS_Value.x - OBS_Value.y, conf.level = 0.90)$p.value # p-value from t test of observations
My real table has over 10 million rows, so the solution needs to be fast and with a relatively low memory footprint.
The expected result would be:
fertName.x YDRange.x fertName.y YDRange.y dat n vari pvalue
1: fertA 36-80 fertB 36-80 <data.table[7x3]> 7 2.333333 1
What is the best way to achieve this?
Steps to reproduce an example dataset:
library(data.table)
# main data table
dt <- "fertName.x YDRange.x fertName.y YDRange.y
fertA 36-80 fertB 36-80
"
dt <- setDT(read.table(textConnection(dt), sep = " ", header=T, stringsAsFactors=FALSE))
# nested data table
nest.dt <- "FIELD_uniqueName OBS_numValue.x OBS_numValue.y
fieldA 5 3
fieldB 4 5
fieldC 7 5
fieldD 5 5
fieldE 4 5
fieldF 4 4
fieldG 5 7
"
nest.dt <- setDT(read.table(textConnection(nest.dt), sep = " ", header=T, stringsAsFactors=FALSE))
dt$dat <- dt[, list(dat=list(nest.dt))]
CodePudding user response:
We could loop over the list
of data.table
with lapply
, then within the data.table
(x
), create the new columns (:=
) based on the OP's code
library(data.table)
dt[, dat := lapply(dat, function(x)
x[, c("n", "vari", "pvalue") := .(.N,
var(OBS_numValue.x - OBS_numValue.y),
t.test(OBS_numValue.x - OBS_numValue.y, conf.level = 0.90)$p.value)])]
-output
> dt
fertName.x YDRange.x fertName.y YDRange.y dat
1: fertA 36-80 fertB 36-80 <data.table[7x6]>
> dt$dat[[1]]
FIELD_uniqueName OBS_numValue.x OBS_numValue.y n vari pvalue
1: fieldA 5 3 7 2.333333 1
2: fieldB 4 5 7 2.333333 1
3: fieldC 7 5 7 2.333333 1
4: fieldD 5 5 7 2.333333 1
5: fieldE 4 5 7 2.333333 1
6: fieldF 4 4 7 2.333333 1
7: fieldG 5 7 7 2.333333 1
If it needs to be separate columns in the dt
dt[, c("n", "vari", "pvalue") := rbindlist(lapply(dat, function(x)
x[, .(.N, var(OBS_numValue.x - OBS_numValue.y), t.test(OBS_numValue.x - OBS_numValue.y, conf.level = 0.90)$p.value)]))]
-output
> dt
fertName.x YDRange.x fertName.y YDRange.y dat n vari pvalue
1: fertA 36-80 fertB 36-80 <data.table[7x3]> 7 2.333333 1