Home > Net >  How to save a csv as bzip2 in R, either within fwrite or after saving the csv using fwrite
How to save a csv as bzip2 in R, either within fwrite or after saving the csv using fwrite

Time:01-31

I have code which uses write.csv to save a large number of files in bzip2 format. Here's a small reproduceable example:

df <- data.frame(A = rnorm(100000), B = rnorm(100000), C = rnorm(100000))
write.csv(df, file = bzfile('df.csv.bzip2'))

I want to speed up the code. I know data.table::fwrite is much faster than write.csv, but I don't know how to get fwrite to save to csv.bzip2. I've optimistically tried the below, but the compression doesn't appear to be working, e.g. the file size is 5.4MB vs. 2.5MB from the write.csv version saved above.

data.table::fwrite(df, 'df2.csv.bzip2') 

Can anyone advise if it's possible to use fwrite to save a compressed csv in bzip2 format? If not, can anyone advise on an alternative way to save a csv via fwrite and then convert to bzip2 format? E.g. something like the below. It's not essential to do the compression within fwrite, I just want to use fwrite to speed up the saving process and for the end product to be a properly-compressed csv.bzip2 file.

data.table::fwrite(df, 'df2.csv') #saves a normal csv
# (add code here which converts the output of ```fwrite``` to a properly-compressed csv.bzip2 file)

NB I'm aware I can save as gzip through fwrite, but I want the file to be in bzip2 format.

CodePudding user response:

If gzip instead of bzip2 solves the compression problem, just set argument compress = "gzip".

data.table::fwrite(iris, '~/Temp/df2.gz')
file.size('~/Temp/df2.gz')
#> [1] 3867

data.table::fwrite(iris, '~/Temp/df2.gz', compress = 'gzip')
file.size('~/Temp/df2.gz')
#> [1] 874

Created on 2023-01-31 with reprex v2.0.2

CodePudding user response:

You can use R.utils::bzip2 to compress the file afterwards.

df <- data.frame(A = rnorm(100000), B = rnorm(100000), C = rnorm(100000))

system.time(write.csv(df, file = bzfile("df.csv.bz2")))
#       User      System verstrichen 
#      0.912       0.005       0.917 

system.time({data.table::fwrite(df, "df2.csv"); R.utils::bzip2("df2.csv")})
#       User      System verstrichen 
#      0.487       0.011       0.473 

system.time(readr::write_csv(df, "df3.csv.bz2")) #Comment from @Ritchie Sacramento
#       User      System verstrichen                                           
#      0.743       0.042       0.988 

file.size("df.csv.bz2")
#[1] 2511607

file.size("df2.csv.bz2")
#[1] 2232901

file.size("df3.csv.bz2")
#[1] 2431997
  • Related