I am trying to read in chunks of a a large data set: find the mean of each chunk (representing a larger column) add the mean into a matrix column then find the mean of the means to give me the overall mean of the column. I have the set up, but my while-loop is not repeating its cycle. I think it may be with how I am referring to "chunks" and "chunk".
This is a practice using "iris.csv" in R
fl <- file("iris.csv", "r")
clname <- readLines(fl, n=1) # read the header
r <- unlist(strsplit(clname,split = ","))
length(r) # get the number of columns in the matrix
cm <- matrix(NA, nrow=1000, ncol=length(r)) # need a matrix that can be filled on each #iteration.
numchunk = 0 #set my chunks of code to build up
while(numchunk <= 0){ #stop when no more chunks left to run
numchunk <- numchunk 1 # keep on moving through chunks of code
x <- readLines(fl, n=100) #read 100 lines at a time
chunk <- as.numeric(unlist(strsplit(x,split = ","))) # readable chunk of code
m <- matrix(chunk, ncol=length(r), byrow = TRUE) # put chunk in a matrix
cm[numchunk,] <- colMeans(m) #get the column means of the matrix and fill in larger matrix
print(numchunk) # print the number of chunks used
}
cm
close(fl)
final_mean <- colSums(cm)/nrow(cm)
return(final_mean)
-- This works when I set my n = 1000, but I want it to work for larger data sets, where the while will need to keep running. Can anyone help me correct this please?
CodePudding user response:
Perhaps, this helps
clname <- readLines(fl, n=1) # read the header
r <- unlist(strsplit(clname,split = ","))
length(r) # get the number of columns in the matrix
cm <- matrix(NA, nrow=1000, ncol=length(r)) #
numchunk = 0
flag <- TRUE
while(flag){
numchunk <- numchunk 1 # keep on moving through chunks of code
x <- readLines(fl, n=5)
print(length(x))
if(length(x) == 0) {
flag <- FALSE
} else {
chunk <- as.numeric(unlist(strsplit(x,split = ","))) # readable chunk of code
m <- matrix(chunk, ncol=length(r), byrow = TRUE) # put chunk in a matrix
cm[numchunk,] <- colMeans(m) #get the column means of the matrix and fill in larger matrix
print(numchunk) # print the number of chunks used
}
}
cm
close(fl)
final_mean <- colSums(cm)/nrow(cm)
CodePudding user response:
First, it might be helpful, to define a helper function r2v()
to split raw lines into useful vectors.
r2v <- Vectorize(\(x) {
## splits raw lines to vectors
strsplit(gsub('\\"', '', x), split=",")[[1]][-1]
})
After opening file, check the size w/o the need to read it in, using system()
and bash commands (for Windows see there.)
## open file
f <- 'iris.csv'
fl <- file(f, "r")
## rows
(nr <-
as.integer(gsub(paste0('\\s', f), '', system(paste('wc -l', f), int=T))) - 1)
# nr <- 150 ## alternatively define nrows manually
# [1] 150
## columns
nm <- readLines(fl, n=1) |> r2v()
(nc <- length(nm))
# [1] 5
Next, define a chunk size by which the rows can be divided.
## define chunk size
ch_sz <- 50
stopifnot(nr %% ch_sz == 0) ## all chunks should be filled
Then, using replicate()
, we calculate chunk-wise rowMeans()
(because we get the chunks transposed), and finally rowMeans()
again on everything to get the column means of the entire matrix.
## calculate means chunk-wise
final_mean <-
replicate(nr / ch_sz,
rowMeans(type.convert(r2v(readLines(fl, n=ch_sz)), as.is=TRUE))) |>
rowMeans()
close(fl)
Vet's validate the result.
## test
all.equal(final_mean, as.numeric(colMeans(iris[-5])))
# [1] TRUE
Data:
iris[-5] |>
write.csv('iris.csv')