Home > Software design >  How to be more memory efficient than melt.data.table?
How to be more memory efficient than melt.data.table?

Time:11-01

For reshaping wide to long data, I've generally found that melt.data.table from the data.table package is the most memory efficient method. This has been compared favourably to the dplyr and base packages online in terms of memory. However, I've found myself unable to melt an object of about 11GB in size, returning the message:

Error: cannot allocate vector of size 10.5 Gb

My Windows computer has 32GB RAM, using 64-bit R. My melt command with a 5-row verison of the dataset is included below (which works). The actual data has nearly 200 million rows.

library(data.table)
test <- structure(list(time = structure(c(667150201.25, 667153801.25, 667157401.25, 667161001.25, 667164601.25), tzone = "UTC", class = c("POSIXct", "POSIXt")),
                       red = c(-2.25, -2.375, -2.5, -0.5, -1.625),
                       orange = c(1.625, 1.375, 1.625, 2.25, 2.5),
                       yellow = c(1.25, 0.5, 1.5, 1.5, 1.625),
                       green = c(2.875, 2.625, 2.5, 3.25, 3),
                       blue = c(4.75, 4.5, 4.75, 4.75, 5.125),
                       purple = c(0.125, -0.125, 0.5, 1.25, 1.375),
                       violet = c(3.125, 2.875, 3.125, 3.375, 3.375),
                       pink = c(3.75, 1.75, 1.5, 1, 0.5)),
                       row.names = c(NA, -5L), class = c("data.table","data.frame"))
melt(test, id.vars='time',
  measure.vars=c('red','orange','yellow','green','blue','purple','violet','pink'),
  variable.name='color', value.name="value")

Is there a more memory efficient method to convert the data form wide to long? The goal is to have a dataset with 3 columns: time, color, and the value.

CodePudding user response:

perhaps an approach like this could work? I have no idea how to measure actual mem-usage..

library(data.table)
test.split <- split.default(test[, -1], names(test)[-1], )
data.table::rbindlist(lapply(test.split, cbind, test$time), use.names = FALSE, id = "color")

CodePudding user response:

If memory is the issue then taking smaller steps should solve it:

mv <- c('red','orange','yellow','green','blue','purple','violet','pink')
OUT <- data.table(color = character(0L), value = numeric(0L))
for (m in mv) {
  OUT <- rbind(OUT, test[, .(color = m, value = get(m))])
  set(test, j = m, value = NULL) # Delete the data since it is not necessary anymore
}
OUT[, time := rep(test$time, .N/nrow(test))]
  • Related