I have a data table of 20M rows and 20 columns, to which I apply vectorized operations that return lists, themselves assigned by reference to additional columns in the data table.
The memory usage increases predictably and modestly throughout those operations, until I apply the (presumably highly efficient) frollmean()
function to a column that contains lists of length 10 using an adaptive window. Running even the much smaller RepRex in R 4.1.2 on Windows 10 x64, with package data.table
1.14.2, the memory usage spikes by ~17GB when executing frollmean()
, before coming back down, as seen in Windows' Task Manager (Performance tab) and measured in the Rprof
memory profiling report.
I understand that frollmean()
uses parallelism where possible, so I did set setDTthreads(threads = 1L)
to make sure the memory spike is not an artifact of making copies of the data table for additional cores.
My question: why is frollmean()
using so much memory relative to other operations, and can I avoid that?
RepRex
library(data.table)
set.seed(1)
setDTthreads(threads = 1L)
obs <- 10^3 # Number of rows in the data table
len <- 10 # Length of each list to be stored in columns
width <- c(seq.int(2), rep(2, len - 2)) # Rolling mean window
# Generate representative data
DT <- data.table(
V1 = sample(x = 1:10, size = obs, replace = TRUE),
V2 = sample(x = 11:20, size = obs, replace = TRUE),
V3 = sample(x = 21:30, size = obs, replace = TRUE)
)
# Apply representative vectorized operations, assigning by reference
DT[, V4 := Map(seq, from = V1, to = V2, length.out = len)] # This is a list
DT[, V5 := Map("*", V4, V3)] # This is a list
DT[, V6 := Map("*", V4, V5)] # This is a list
# Profile the memory usage
Rprof(memory.profiling = TRUE)
# Rolling mean
DT[, V7 := frollmean(x = V6, n = width, adaptive = TRUE)]
# Report the memory usage
Rprof(NULL)
summaryRprof(memory = "both")
CodePudding user response:
Consider avoiding embeded lists inside columns. Recall the data.frame
and data.table
classes are extensions of list
types where typeof(DT)
returns "list"
. Hence, instead of running frollmean
on nested lists, consider running across vector columns:
obs <- 10^3 # Number of rows in the data table
len <- 10 # Length of each list to be stored in columns
width <- c(seq.int(2), rep(2, len - 2)) # Rolling mean window
# CALCULATE SEC VECTOR (USING mapply THE PARENT TO ITS WRAPPER Map)
set.seed(1)
V1 = sample(x = 1:10, size = obs, replace = TRUE)
V2 = sample(x = 11:20, size = obs, replace = TRUE)
V3 = sample(x = 21:30, size = obs, replace = TRUE)
seq_vec <- as.vector(mapply(seq, from = V1, to = V2, length.out = len))
# BUILD DATA.TABLE USING SEQ VECTOR FOR FLAT ATOMIC VECTOR COLUMNS
DT_ <- data.table(
WIDTH = rep(width, obs),
V1 = rep(V1, each=len),
V2 = rep(V2, each=len),
V3 = rep(V3, each=len),
V4 = seq_vec
)[, V5 := V4*V3][,V6 := V4*V5]
DT_
WIDTH V1 V2 V3 V4 V5 V6
1: 1 9 20 29 9.00000 261.0000 2349.000
2: 2 9 20 29 10.22222 296.4444 3030.321
3: 2 9 20 29 11.44444 331.8889 3798.284
4: 2 9 20 29 12.66667 367.3333 4652.889
5: 2 9 20 29 13.88889 402.7778 5594.136
---
9996: 2 5 16 26 11.11111 288.8889 3209.877
9997: 2 5 16 26 12.33333 320.6667 3954.889
9998: 2 5 16 26 13.55556 352.4444 4777.580
9999: 2 5 16 26 14.77778 384.2222 5677.951
10000: 2 5 16 26 16.00000 416.0000 6656.000
Then calculate frollmean
by V1 and V2 grouping:
DT_[, V7 := frollmean(x = V6, n = WIDTH, adaptive = TRUE), by=.(V1, V2)]
Output should be equivalent to the nested list value columns:
identical(DT$V4[[1]], DT_$V4[1:len])
[1] TRUE
identical(DT$V5[[1]], DT_$V5[1:len])
[1] TRUE
identical(DT$V6[[1]], DT_$V6[1:len])
[1] TRUE
identical(DT$V7[[1]], DT_$V7[1:len])
[1] TRUE
In doing so, profiling shows less steps and memory between the different approaches to calculations. Below runs on obs <- 10^5
.
frollmean
on nested list column using DT
)
# Profile the memory usage
Rprof(memory.profiling = TRUE)
DT[, V7 := frollmean(x = V6, n = width, adaptive = TRUE)]
# Report the memory usage
Rprof(NULL)
summaryRprof(mem="both")
$by.self
self.time self.pct total.time total.pct mem.total
"froll" 1.30 76.47 1.30 76.47 1584.6
"FUN" 0.14 8.24 0.30 17.65 161.3
"eval" 0.12 7.06 1.46 85.88 1670.9
"vapply" 0.10 5.88 0.40 23.53 181.3
"parent.frame" 0.04 2.35 0.04 2.35 24.8
$by.total
total.time total.pct mem.total self.time self.pct
"[.data.table" 1.70 100.00 1765.9 0.00 0.00
"[" 1.70 100.00 1765.9 0.00 0.00
"eval" 1.46 85.88 1670.9 0.12 7.06
"froll" 1.30 76.47 1584.6 1.30 76.47
"frollmean" 1.30 76.47 1584.6 0.00 0.00
"vapply" 0.40 23.53 181.3 0.10 5.88
"%chin%" 0.40 23.53 181.3 0.00 0.00
"vapply_1c" 0.40 23.53 181.3 0.00 0.00
"which" 0.40 23.53 181.3 0.00 0.00
"FUN" 0.30 17.65 161.3 0.14 8.24
"parent.frame" 0.04 2.35 24.8 0.04 2.35
$sample.interval
[1] 0.02
$sampling.time
[1] 1.7
frollmean
on atomic vector column by group using DT_
)
# Profile the memory usage
Rprof(memory.profiling = TRUE)
DT_[, V7 := frollmean(x = V6, n = WIDTH, adaptive = TRUE), by=.(V1, V2)]
# Report the memory usage
Rprof(NULL)
summaryRprof(mem="both")
$by.self
self.time self.pct total.time total.pct mem.total
"[.data.table" 0.02 33.33 0.06 100.00 18.7
"forderv" 0.02 33.33 0.02 33.33 0.0
"froll" 0.02 33.33 0.02 33.33 10.6
$by.total
total.time total.pct mem.total self.time self.pct
"[.data.table" 0.06 100.00 18.7 0.02 33.33
"[" 0.06 100.00 18.7 0.00 0.00
"forderv" 0.02 33.33 0.0 0.02 33.33
"froll" 0.02 33.33 10.6 0.02 33.33
"frollmean" 0.02 33.33 10.6 0.00 0.00
$sample.interval
[1] 0.02
$sampling.time
[1] 0.06
(Interestingly, on my Linux laptop of 8 GB RAM, at 10^6
obs, list column but not vector column approach raised Error: cannot allocate vector of size 15.3 Gb
).