I found that by default polars' output Parquet files are around 35% larger than Parquet files output by Spark (on the same data). Spark uses snappy for compression by default and it doesn't help if I switch ParquetCompression to snappy in polars. I wonder is this due to polars use a more conservative compression ratio? Is there any way to control the compression level of Parquet files in polars? I checked the doc of polars, it seems that only Zstd accept a ZstdLevel (not even sure whether it is compression level).
Below is my code to write a DataFrame to a Parquet file using the snappy compression.
let f = File::create("j.parquet").expect("Unable to create the file j.parquet!");
let mut bfw = BufWriter::new(f);
let pw = ParquetWriter::new(bfw).with_compression(ParquetCompression::Snappy);
pw.finish(&mut df);
CodePudding user response:
This is not (yet) possible in rust polars
. It will likely be in next release of arrow2
and then we can implement it in polars as well.
If you want that functionality in python polars
you can leverage pyarrow
for this purpose. polars
has zero copy interop with pyarrow
.