Writing a huge R dataframe into 4 separate files in R-CodePudding

I have a tibble/dataframe in R with about 206 million records and 5 columns. My system runs out of memory if I do any further analysis/computation on this data. Hence, I want to write this tibble into 4 separate csv files (to disk) of ~50 million records each (last one would be ~56 million) and proceed with further computation/analysis in 4 separate iterations. I searched a few threads on the web could not find any suitable to this usecase.

How can I achieve this?

CodePudding user response：

Let us know if your machine has the memory for the below. This is to achieve OP's goal (request) to split then save original df into 4 separate files

library(data.table)
setDT(df)

# dummy data
df <- data.table(row_id = 1:123)

# parameters
x <- nrow(df)  # nrow of df
y <- 4    # no. of splits

# create batch number
df[, batch := rep(1:y, each=x/y, length.out=x)]

# split
df <- split(df, by='batch')

# save as separate csv
lapply( df, \(i) fwrite(i, file = paste0( i[1][1,'batch'], '.csv')) )

CodePudding user response：

Apologies if this solution misses the mark, but I believe the below should work:

df %>% #name of dataframe
slice(1:5.0e7) %>% #first 50M rows
write_csv("file_a.csv") #save as csv

and repeat for the remaining sets just changing the reference for slice() and the file name in write_csv()