I have a tibble/dataframe in R with about 206 million records and 5 columns. My system runs out of memory if I do any further analysis/computation on this data. Hence, I want to write this tibble into 4 separate csv files (to disk) of ~50 million records each (last one would be ~56 million) and proceed with further computation/analysis in 4 separate iterations. I searched a few threads on the web could not find any suitable to this usecase.
How can I achieve this?
CodePudding user response:
Let us know if your machine has the memory for the below. This is to achieve OP's goal (request) to split then save original df
into 4 separate files
library(data.table)
setDT(df)
# dummy data
df <- data.table(row_id = 1:123)
# parameters
x <- nrow(df) # nrow of df
y <- 4 # no. of splits
# create batch number
df[, batch := rep(1:y, each=x/y, length.out=x)]
# split
df <- split(df, by='batch')
# save as separate csv
lapply( df, \(i) fwrite(i, file = paste0( i[1][1,'batch'], '.csv')) )
CodePudding user response:
Apologies if this solution misses the mark, but I believe the below should work:
df %>% #name of dataframe
slice(1:5.0e7) %>% #first 50M rows
write_csv("file_a.csv") #save as csv
and repeat for the remaining sets just changing the reference for slice()
and the file name in write_csv()