I have code which uses write.csv
to save a large number of files in bzip2 format. Here's a small reproduceable example:
df <- data.frame(A = rnorm(100000), B = rnorm(100000), C = rnorm(100000))
write.csv(df, file = bzfile('df.csv.bzip2'))
I want to speed up the code. I know data.table::fwrite
is much faster than write.csv
, but I don't know how to get fwrite
to save to csv.bzip2. I've optimistically tried the below, but the compression doesn't appear to be working, e.g. the file size is 5.4MB vs. 2.5MB from the write.csv version saved above.
data.table::fwrite(df, 'df2.csv.bzip2')
Can anyone advise if it's possible to use fwrite
to save a compressed csv in bzip2 format? If not, can anyone advise on an alternative way to save a csv via fwrite
and then convert to bzip2 format? E.g. something like the below. It's not essential to do the compression within fwrite, I just want to use fwrite to speed up the saving process and for the end product to be a properly-compressed csv.bzip2 file.
data.table::fwrite(df, 'df2.csv') #saves a normal csv
# (add code here which converts the output of ```fwrite``` to a properly-compressed csv.bzip2 file)
NB I'm aware I can save as gzip through fwrite
, but I want the file to be in bzip2 format.
CodePudding user response:
If gzip
instead of bzip2
solves the compression problem, just set argument compress = "gzip"
.
data.table::fwrite(iris, '~/Temp/df2.gz')
file.size('~/Temp/df2.gz')
#> [1] 3867
data.table::fwrite(iris, '~/Temp/df2.gz', compress = 'gzip')
file.size('~/Temp/df2.gz')
#> [1] 874
Created on 2023-01-31 with reprex v2.0.2
CodePudding user response:
You can use R.utils::bzip2
to compress the file afterwards.
df <- data.frame(A = rnorm(100000), B = rnorm(100000), C = rnorm(100000))
system.time(write.csv(df, file = bzfile("df.csv.bz2")))
# User System verstrichen
# 0.912 0.005 0.917
system.time({data.table::fwrite(df, "df2.csv"); R.utils::bzip2("df2.csv")})
# User System verstrichen
# 0.487 0.011 0.473
system.time(readr::write_csv(df, "df3.csv.bz2")) #Comment from @Ritchie Sacramento
# User System verstrichen
# 0.743 0.042 0.988
file.size("df.csv.bz2")
#[1] 2511607
file.size("df2.csv.bz2")
#[1] 2232901
file.size("df3.csv.bz2")
#[1] 2431997