Is there an efficient way of changing a feather file to a parquet file?-CodePudding

I have a big feather file, which I want to change to parquet, so that I can work with Pyspark. Is there a more efficient way of change the file type than doing the following:

df = pd.read_feather('file.feather').set_index('date')

df_parquet = df.astype(str)
df_parquet.to_parquet("path/file.gzip",
               compression='gzip')

As dataframe df kills my memory, I'm looking for alternatives. As of this post I understand that I can't read in feather from Pyspark directly

CodePudding user response：

With the code you posted, you are doing the following conversions:

Load data from Disk into RAM; feather files already are in the Arrow format.
Convert the DataFrame from Arrow into pandas
Convert the DataFrame from pandas into Arrow
Serialize the DataFrame from Arrow into Parquet.

Steps 2-4 are each expensive steps to do. You will not be able to avoid 4 but by keeping the data in Arrow without going the loop into pandas, you can avoid 2 3 with the following code snippet:

import pyarrow as pa
import pyarrow.feather as feather
import pyarrow.parquet as pq

table = feather.read_table("file.feather")
pq.write_table(table, "path/file.parquet")

A minor issue but you should avoid using the .gzip ending with Parquet files. A .gzip / .gz ending indicates that the whole file is compressed with gzip and that you can unzip it with gunzip. This is not the case with gzip-compressed Parquet files. The Parquet format compresses individual segments and leaves the metadata uncompressed. This leads to nearly the same compression at a much higher compression speed. The compression algorithm is thus an implementation detail and not transparent to other tools.