Home > Blockchain >  Loading an 11 GB .csv file as a big.matrix object
Loading an 11 GB .csv file as a big.matrix object

Time:07-24

I have a 11GB .csv file which I would ultimately need as a big.matrix object. From what I have read I think I need to create a filebacked big.matrix object but I cannot figure out how to do this.

The file is too large for me to load directly into R and manipulate from there as I have done with smaller datasets. How do I produce a big.matrix object from the .csv file?

CodePudding user response:

See if this can be of help. I post as an answer because it contains too much code for a comment.

The strategy is to read chunks of 10K rows at a time and coerce them to a sparse matrix. Then, rbind those sub-matrices together.
It uses data.table::fread for speed and a function in package fpeek to count the number of lines in the data file. This function is also fast.

library(data.table)
library(Matrix)

flname <- "your_filename"
nlines <- fpeek::peek_count_lines(flname)
chunk <- 10*1024

passes <- nlines %/% chunk
remaining <- nlines %% chunk
skip <- 0

data_list <- vector("list", length = passes   (remaining > 0))
for(i in seq_len(passes)) {
  tmp <- fread(flname, sep = ",", colClasses = "double", skip = skip, nrows = chunk)
  data_list[[i]] <- Matrix(as.matrix(tmp), sparse = TRUE)
  skip <- skip   chunk
}
if(remaining > 0) {
  tmp <- fread(flname, sep = ",", colClasses = "double", skip = skip)
  data_list[[passes   1L]] <- Matrix(as.matrix(tmp), sparse = TRUE)
}

sparse_mat <- do.call(rbind, data_list)
rm(data_list)

Test data

With the following test data all went alright. I also tried it with a bigger matrix.

The path is optional.

path <- "~/Temp"
flname <- file.path(path, "big_example.csv")
a <- matrix(1:(25*1024), ncol = 1)
b <- matrix(rbinom(25*1024*10, size = 1, prob = 0.01), ncol = 10)
a <- cbind(a, b)
dim(a)
write.csv(a, fl, row.names = FALSE)
  • Related