Home > OS >  Partition a large list into chunks with convenient I/O
Partition a large list into chunks with convenient I/O

Time:03-18

I have a large list with size of approx. 1.3GB. I'm looking for the fastest solution in R to generate chunks and save them in any convenient format so that :

a) every saved file of the chunk is less than 100MB large

b) the original list can be loaded conveniently and fast into a new R workspace

EDIT II : The reason to do so is a R-solution to bypass the GitHub file size restriction of 100MB per file. The limitation to R is due to some external non-technical restrictions which I can't comment.

What is the best solution for this problem?

EDIT I: Since it was mentioned in the comments that some code for the problem is helpful to create a better question:

An R-example of a list with size of 1.3 GB:

li <- list(a = rnorm(10^8),
           b =  rnorm(10^7.8))

CodePudding user response:

So, you want to split a file and to reload it in a single dataframe.

There is a twist: to reduce file size, it would be wise to compress, but then the file size is not entirely deterministic. You may have to tweak a parameter.

The following is a piece of code I have used for a similar task (unrelated to GitHub though).

The split.file function takes 3 arguments: a dataframe, the number of rows to write in each file, and the base filename. For instance, if basename is "myfile", the files will be "myfile00001.rds", "myfile00002.rds", etc. The function returns the number of files written.

The join.files function takes the base name.

Note:

  • Play with the rows parameter to find out the correct size to fit in 100 MB. It depends on your data, but for similar datasets a fixed size should do. However, if you are dealing with very different datasets, this approach will likely fail.
  • When reading, you need to have twice as much memory as occupied by your dataframe (because a list of the smaller dataframes is first read, then rbinded.
  • The number is written as 5 digits, but you can change that. The goal is to have the names in lexicographic order, so that when the files are concatenated, the rows are in the same order as the original file.

Here are the functions:

split.file <- function(db, rows, basename) {
  n = nrow(db)
  m = n %/% rows
  for (k in seq_len(m)) {
    db.sub <- db[seq(1   (k-1)*rows, k*rows), , drop = F]
    saveRDS(db.sub, file = sprintf("%s%.5d.rds", basename, k),
            compress = "xz", ascii = F)
  }
  if (m * rows < n) {
    db.sub <- db[seq(1   m*rows, n), , drop = F]
    saveRDS(db.sub, file = sprintf("%s%.5d.rds", basename, m 1),
            compress = "xz", ascii = F)
    m <- m   1
  }
  m
}

join.files <- function(basename) {
  files <- sort(list.files(pattern = sprintf("%s[0-9]{5}\\.rds", basename)))
  do.call("rbind", lapply(files, readRDS))
}

Example:

n <- 1500100
db <- data.frame(x = rnorm(n))
split.file(db, 100000, "myfile")
dbx <- join.files("myfile")
all(dbx$x == db$x)
  • Related