Combining 100 RDS/RData files into one large file - large file too big-CodePudding

I have 100 .rds files, each approximately 2510 KB in size, and would like to bind them all together by row into one large data file.

So far I am using this:

memory.limit(size=1500000000)

files = list.files(path = "mypath", pattern = "*.rds")

dat_list = lapply(files, function (x) data.table(readRDS(x)))

all <- do.call("rbind", dat_list)

This seems to work but when running the final line I get a "cannot allocate vector of size..." error which I understand to be due to the final file I am trying to create being too large.

As you can see I have tried increasing the memory limit in R but this does not help. Is there any way I can get around this ? I have read of methods of combining csv files outside of R so the R memory is not affected - is there a similar method that can be used here?

I intend to convert this to a file - mapped big.matrix object later if that helps? I also have the same files in RData format.

Would appreciate any help anyone can offer!

CodePudding user response：

Update: using the newer purrr::map_df() function, which combines map and bind_rows and returns a dataframe

https://purrr.tidyverse.org/reference/map.html

library(tidyverse)
my_files = list.files(pattern = "*.rds")
my_all <- map_df(my_files, read_rds)

...

The dplyr::bind_rows() function is explicitly an efficient implementation of the common pattern of do.call(rbind, dfs) for binding many data frames into one.

https://dplyr.tidyverse.org/reference/bind.html

library(tidyverse)
write_rds(iris, "iris1.rds") #write three sample files
write_rds(iris, "iris2.rds")
write_rds(iris, "iris3.rds")
my_files = list.files(pattern = "*.rds")
dat_list = lapply(my_files, function (x) read_rds(x)) #switched to only read_rds()
my_all <- do.call("bind_rows", dat_list) #switched to bind_rows()

CodePudding user response：

As a follow-up to my comment:

You could save the Rds files sequentially into tsv or csv file(s) (e.g. by using data.table::fwrite; that would work for simple data structures, at least) - and either generate one big file by appending each file sequentially and then removing it from memory (e.g.using fwrite with append=TRUE), or saving one by one and concatenating them on the command line). If the resulting text file is still too large to fit into memory, you could then load it in chunks, use vroom, etc., to get the data back into R.

Below is an example showing the idea:

library(data.table)
# generate 100 Rds files as examples
invisible(lapply(1:100, 
    \(x) saveRDS(data.frame(matrix(rnorm(1e5), ncol=5, nrow = 2e4, 
        dimnames=list(NULL, paste0("col", 1:5)))), sprintf("filed.Rds", x))))

# files to concatenate
files <- list.files(pattern="file.*.Rds")

# assuming all files have the same column names, you could retrieve it from first file
cn <- colnames(readRDS(file=files[1]))
fwrite(data.table(t(cn)), file="outfile.csv", col.names = FALSE)
# sequentially load Rds files, save appending to newly created output file
invisible(lapply(files, 
    \(x) {fwrite(readRDS(x), file="outfile.csv", col.names = FALSE, append=TRUE)}))

# open with vroom
library(vroom)
vroom("outfile.csv")
#> Rows: 2000000 Columns: 5
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: ","
#> dbl (5): col1, col2, col3, col4, col5
#> 
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#> # A tibble: 2,000,000 × 5
#>      col1     col2   col3   col4   col5
#>     <dbl>    <dbl>  <dbl>  <dbl>  <dbl>
#>  1 -1.08  -0.366    1.26  -0.791  1.37 
#>  2  0.365  0.382   -0.742 -0.648 -0.800
#>  3  1.09  -0.618    0.480  1.64   0.155
#>  4  2.54   0.170   -0.654  0.537  0.140
#>  5 -0.331  0.262    0.156  0.360 -0.250
#>  6 -0.349 -0.00872  0.322  0.698  0.653
#>  7  0.353 -0.0634   1.28  -0.402 -1.54 
#>  8  1.35   1.15    -1.05   0.410 -0.183
#>  9 -0.499 -3.07     1.14  -0.878  1.11 
#> 10  0.479  1.30     0.718  1.17  -1.02 
#> # … with 1,999,990 more rows
#> # ℹ Use `print(n = ...)` to see more rows

^{Created on 2022-07-23 by the reprex package (v2.0.1)}