Unexpectedly-high memory usage from data.table::frollmean()-CodePudding

I have a data table of 20M rows and 20 columns, to which I apply vectorized operations that return lists, themselves assigned by reference to additional columns in the data table.

The memory usage increases predictably and modestly throughout those operations, until I apply the (presumably highly efficient) frollmean() function to a column that contains lists of length 10 using an adaptive window. Running even the much smaller RepRex in R 4.1.2 on Windows 10 x64, with package data.table 1.14.2, the memory usage spikes by ~17GB when executing frollmean(), before coming back down, as seen in Windows' Task Manager (Performance tab) and measured in the Rprof memory profiling report.

I understand that frollmean() uses parallelism where possible, so I did set setDTthreads(threads = 1L) to make sure the memory spike is not an artifact of making copies of the data table for additional cores.

My question: why is frollmean() using so much memory relative to other operations, and can I avoid that?

RepRex

library(data.table)
set.seed(1)
setDTthreads(threads = 1L)

obs   <- 10^3 # Number of rows in the data table
len   <- 10   # Length of each list to be stored in columns
width <- c(seq.int(2), rep(2, len - 2)) # Rolling mean window

# Generate representative data
DT <- data.table(
  V1 = sample(x =  1:10, size = obs, replace = TRUE),
  V2 = sample(x = 11:20, size = obs, replace = TRUE),
  V3 = sample(x = 21:30, size = obs, replace = TRUE)
)

# Apply representative vectorized operations, assigning by reference
DT[, V4 := Map(seq, from = V1, to = V2, length.out = len)] # This is a list
DT[, V5 := Map("*", V4, V3)] # This is a list
DT[, V6 := Map("*", V4, V5)] # This is a list

# Profile the memory usage
Rprof(memory.profiling = TRUE)

# Rolling mean
DT[, V7 := frollmean(x = V6, n = width, adaptive = TRUE)]

# Report the memory usage
Rprof(NULL)
summaryRprof(memory = "both")

CodePudding user response：

Consider avoiding embeded lists inside columns. Recall the data.frame and data.table classes are extensions of list types where typeof(DT) returns "list". Hence, instead of running frollmean on nested lists, consider running across vector columns:

obs   <- 10^3 # Number of rows in the data table
len   <- 10   # Length of each list to be stored in columns
width <- c(seq.int(2), rep(2, len - 2)) # Rolling mean window

# CALCULATE SEC VECTOR (USING mapply THE PARENT TO ITS WRAPPER Map)
set.seed(1)
V1 = sample(x =  1:10, size = obs, replace = TRUE)
V2 = sample(x = 11:20, size = obs, replace = TRUE)
V3 = sample(x = 21:30, size = obs, replace = TRUE)
seq_vec <- as.vector(mapply(seq, from = V1, to = V2, length.out = len))

# BUILD DATA.TABLE USING SEQ VECTOR FOR FLAT ATOMIC VECTOR COLUMNS
DT_ <- data.table(
  WIDTH = rep(width, obs),
  V1 = rep(V1, each=len),
  V2 = rep(V2, each=len),
  V3 = rep(V3, each=len),
  V4 = seq_vec
)[, V5 := V4*V3][,V6 := V4*V5]

DT_
       WIDTH V1 V2 V3       V4       V5       V6
    1:     1  9 20 29  9.00000 261.0000 2349.000
    2:     2  9 20 29 10.22222 296.4444 3030.321
    3:     2  9 20 29 11.44444 331.8889 3798.284
    4:     2  9 20 29 12.66667 367.3333 4652.889
    5:     2  9 20 29 13.88889 402.7778 5594.136
   ---                                          
 9996:     2  5 16 26 11.11111 288.8889 3209.877
 9997:     2  5 16 26 12.33333 320.6667 3954.889
 9998:     2  5 16 26 13.55556 352.4444 4777.580
 9999:     2  5 16 26 14.77778 384.2222 5677.951
10000:     2  5 16 26 16.00000 416.0000 6656.000

Then calculate frollmean by V1 and V2 grouping:

DT_[, V7 := frollmean(x = V6, n = WIDTH, adaptive = TRUE),  by=.(V1, V2)]

Output should be equivalent to the nested list value columns:

identical(DT$V4[[1]], DT_$V4[1:len])
[1] TRUE
identical(DT$V5[[1]], DT_$V5[1:len])
[1] TRUE
identical(DT$V6[[1]], DT_$V6[1:len])
[1] TRUE
identical(DT$V7[[1]], DT_$V7[1:len])
[1] TRUE

In doing so, profiling shows less steps and memory between the different approaches to calculations. Below runs on obs <- 10^5.

frollmean on nested list column using DT)

# Profile the memory usage
Rprof(memory.profiling = TRUE)
DT[, V7 := frollmean(x = V6, n = width, adaptive = TRUE)]
# Report the memory usage
Rprof(NULL)
summaryRprof(mem="both")
$by.self
               self.time self.pct total.time total.pct mem.total
"froll"             1.30    76.47       1.30     76.47    1584.6
"FUN"               0.14     8.24       0.30     17.65     161.3
"eval"              0.12     7.06       1.46     85.88    1670.9
"vapply"            0.10     5.88       0.40     23.53     181.3
"parent.frame"      0.04     2.35       0.04      2.35      24.8

$by.total
               total.time total.pct mem.total self.time self.pct
"[.data.table"       1.70    100.00    1765.9      0.00     0.00
"["                  1.70    100.00    1765.9      0.00     0.00
"eval"               1.46     85.88    1670.9      0.12     7.06
"froll"              1.30     76.47    1584.6      1.30    76.47
"frollmean"          1.30     76.47    1584.6      0.00     0.00
"vapply"             0.40     23.53     181.3      0.10     5.88
"%chin%"             0.40     23.53     181.3      0.00     0.00
"vapply_1c"          0.40     23.53     181.3      0.00     0.00
"which"              0.40     23.53     181.3      0.00     0.00
"FUN"                0.30     17.65     161.3      0.14     8.24
"parent.frame"       0.04      2.35      24.8      0.04     2.35

$sample.interval
[1] 0.02

$sampling.time
[1] 1.7

frollmean on atomic vector column by group using DT_)

# Profile the memory usage
Rprof(memory.profiling = TRUE)
DT_[, V7 := frollmean(x = V6, n = WIDTH, adaptive = TRUE),  by=.(V1, V2)]
# Report the memory usage
Rprof(NULL)
summaryRprof(mem="both")
$by.self
               self.time self.pct total.time total.pct mem.total
"[.data.table"      0.02    33.33       0.06    100.00      18.7
"forderv"           0.02    33.33       0.02     33.33       0.0
"froll"             0.02    33.33       0.02     33.33      10.6

$by.total
               total.time total.pct mem.total self.time self.pct
"[.data.table"       0.06    100.00      18.7      0.02    33.33
"["                  0.06    100.00      18.7      0.00     0.00
"forderv"            0.02     33.33       0.0      0.02    33.33
"froll"              0.02     33.33      10.6      0.02    33.33
"frollmean"          0.02     33.33      10.6      0.00     0.00

$sample.interval
[1] 0.02

$sampling.time
[1] 0.06

(Interestingly, on my Linux laptop of 8 GB RAM, at 10^6 obs, list column but not vector column approach raised Error: cannot allocate vector of size 15.3 Gb).