How do I fill in a matrix (by chunks) using a while loop?-CodePudding

I am trying to read in chunks of a a large data set: find the mean of each chunk (representing a larger column) add the mean into a matrix column then find the mean of the means to give me the overall mean of the column. I have the set up, but my while-loop is not repeating its cycle. I think it may be with how I am referring to "chunks" and "chunk".

This is a practice using "iris.csv" in R

fl <- file("iris.csv", "r")
clname <- readLines(fl, n=1) # read the header
r <- unlist(strsplit(clname,split = ","))
length(r) # get the number of columns in the matrix
cm <- matrix(NA, nrow=1000, ncol=length(r)) # need a matrix that can be filled on each #iteration.
numchunk = 0 #set my chunks of code to build up
while(numchunk <= 0){ #stop when no more chunks left to run
  numchunk <- numchunk   1 # keep on moving through chunks of code
  x <- readLines(fl, n=100) #read 100 lines at a time
  chunk <- as.numeric(unlist(strsplit(x,split = ","))) # readable chunk of code
  m <- matrix(chunk, ncol=length(r), byrow = TRUE) # put chunk in a matrix
  cm[numchunk,] <- colMeans(m) #get the column means of the matrix and fill in larger matrix
  print(numchunk) # print the number of chunks used
}
cm
close(fl)
final_mean <- colSums(cm)/nrow(cm)
return(final_mean)

-- This works when I set my n = 1000, but I want it to work for larger data sets, where the while will need to keep running. Can anyone help me correct this please?

CodePudding user response：

Perhaps, this helps

clname <- readLines(fl, n=1) # read the header
r <- unlist(strsplit(clname,split = ","))
length(r) # get the number of columns in the matrix
cm <- matrix(NA, nrow=1000, ncol=length(r)) # 
numchunk = 0 
flag <- TRUE
while(flag){ 
  numchunk <- numchunk   1 # keep on moving through chunks of code
  x <- readLines(fl, n=5) 
  print(length(x))
  if(length(x) == 0) {
      flag <- FALSE
      } else {
  
       
  
  chunk <- as.numeric(unlist(strsplit(x,split = ","))) # readable chunk of code
  m <- matrix(chunk, ncol=length(r), byrow = TRUE) # put chunk in a matrix
  cm[numchunk,] <- colMeans(m) #get the column means of the matrix and fill in larger matrix
  print(numchunk) # print the number of chunks used
  }
  
}
cm
close(fl)
final_mean <- colSums(cm)/nrow(cm)

CodePudding user response：

First, it might be helpful, to define a helper function r2v() to split raw lines into useful vectors.

r2v <- Vectorize(\(x) {
  ## splits raw lines to vectors
  strsplit(gsub('\\"', '', x), split=",")[[1]][-1]
  })

After opening file, check the size w/o the need to read it in, using system() and bash commands (for Windows see there.)

## open file
f <- 'iris.csv'
fl <- file(f, "r")

## rows
(nr <- 
    as.integer(gsub(paste0('\\s', f), '', system(paste('wc -l', f), int=T))) - 1)
# nr <- 150  ## alternatively define nrows manually
# [1] 150

## columns
nm <- readLines(fl, n=1) |> r2v()
(nc <- length(nm))
# [1] 5

Next, define a chunk size by which the rows can be divided.

## define chunk size
ch_sz <- 50
stopifnot(nr %% ch_sz == 0)  ## all chunks should be filled

Then, using replicate(), we calculate chunk-wise rowMeans() (because we get the chunks transposed), and finally rowMeans() again on everything to get the column means of the entire matrix.

## calculate means chunk-wise
final_mean <-
  replicate(nr / ch_sz, 
            rowMeans(type.convert(r2v(readLines(fl, n=ch_sz)), as.is=TRUE))) |>
  rowMeans()
close(fl)

Vet's validate the result.

## test
all.equal(final_mean, as.numeric(colMeans(iris[-5])))
# [1] TRUE

Data:

iris[-5] |>
  write.csv('iris.csv')