Write python polars lazy_frame to csv gzip archive after collect()-CodePudding

What is the best way to write a gzip archive csv in python polars?

This is my current implementation:

import polars as pl
import gzip

# create a dataframe
df = pl.DataFrame({
    "foo": [1, 2, 3, 4, 5],
    "bar": [6, 7, 8, 9, 10],
    "ham": ["a", "b", "c", "d", "e"]
})

# collect dataframe to memory and write to gzip file
file_path = 'compressed_dataframe.gz'
with gzip.open(file_path, 'wb') as f:
    df.collect().write_csv(f)

CodePudding user response：

Your current implementation looks good. You are correctly using the gzip library to compress the CSV file, and the polars library to write the dataframe to a CSV file.

One thing to consider is that if your dataframe is very large, it might be more memory efficient to write the dataframe to the gzip file in chunks, rather than collecting the entire dataframe to memory first. You can do this by using the to_csv() method in chunks and passing the file handle to it.

import polars as pl
import gzip

# create a dataframe
df = pl.DataFrame({
    "foo": [1, 2, 3, 4, 5],
    "bar": [6, 7, 8, 9, 10],
    "ham": ["a", "b", "c", "d", "e"]
})

# write dataframe to gzip file in chunks
file_path = 'compressed_dataframe.gz'
with gzip.open(file_path, 'wb') as f:
    df.to_csv(f, chunksize=1000)

This will write the dataframe to the gzip file in chunks of 1000 rows at a time. You can adjust the chunksize to a value that suits your use case.

CodePudding user response：

The below is a further test of the gzip method. Polars will recommend passing a path for better performance instead a file object. While gzip write is very costly at a 60s, gz read was only 4 seconds compared to 0.89 s for the regular csv.

import polars as pl
import gzip
from datetime import datetime
import os

import numpy as np

# Create a large dataframe with 10 million rows and 5 columns
rows = 10000000
df = pl.DataFrame({
    "col1": np.random.randint(0, 100, rows),
    "col2": np.random.randn(rows),
    "col3": np.random.choice(["apple", "banana", "orange"], rows),
    "col4": np.random.randint(0, 100, rows),
    "col5": np.random.randn(rows)
}).lazy()

# collect dataframe to memory and write to gzip file
file_path = 'compressed_dataframe.gz'
comparison = 'comparison.csv'

start = datetime.now()
df.collect().write_csv(comparison)
print(f"comparison saved at {datetime.now()-start}")

start = datetime.now()
pl.read_csv(comparison)
print(f"comparison read at {datetime.now()-start}")

start = datetime.now()
with gzip.open(file_path, 'wb') as f:
    df.collect().write_csv(f)
print(f"gz saved at {datetime.now()-start}")

start = datetime.now()
with gzip.open(file_path, 'rb') as f:
    print(pl.read_csv(f))
print(f"gz read at {datetime.now()-start}")


print(f"{comparison} has size of {os.path.getsize(comparison)}")

print(f"{file_path} has size of {os.path.getsize(file_path)}")

Output:

comparison saved at 0:00:01.505098  comparison read at 0:00:00.899469
gz saved at 0:00:57.891954  Polars found a filename. Ensure you pass a
path to the file instead of a python file object when possible for best performance.shape: (10000000, 5) 
gz read at 0:00:04.539318 
comparison.csv has size of 517290342 
compressed_dataframe.gz has size of 224153795

compression ratio is 2.30