I have a big feather file, which I want to change to parquet, so that I can work with Pyspark. Is there a more efficient way of change the file type than doing the following:
df = pd.read_feather('file.feather').set_index('date')
df_parquet = df.astype(str)
df_parquet.to_parquet("path/file.gzip",
compression='gzip')
As dataframe df
kills my memory, I'm looking for alternatives. As of this post I understand that I can't read in feather from Pyspark directly
CodePudding user response:
With the code you posted, you are doing the following conversions:
- Load data from Disk into RAM; feather files already are in the Arrow format.
- Convert the DataFrame from Arrow into pandas
- Convert the DataFrame from pandas into Arrow
- Serialize the DataFrame from Arrow into Parquet.
Steps 2-4 are each expensive steps to do. You will not be able to avoid 4 but by keeping the data in Arrow without going the loop into pandas, you can avoid 2 3 with the following code snippet:
import pyarrow as pa
import pyarrow.feather as feather
import pyarrow.parquet as pq
table = feather.read_table("file.feather")
pq.write_table(table, "path/file.parquet")
A minor issue but you should avoid using the .gzip
ending with Parquet files. A .gzip
/ .gz
ending indicates that the whole file is compressed with gzip
and that you can unzip it with gunzip
. This is not the case with gzip-compressed Parquet files. The Parquet format compresses individual segments and leaves the metadata uncompressed. This leads to nearly the same compression at a much higher compression speed. The compression algorithm is thus an implementation detail and not transparent to other tools.