I'm working with a parquet file in python with Pandas and I got a surprise.
The parquet file that I read weighs 290MB
and when I do a filtering to remove records from that dataframe and save it with df.to_parquet()
I find that the new parquet weighs 390MB
How can that be?
Here my code example:
df = read_parquet('path/to/file.parquet')
ids_to_remove = ['id1','id2']
def filterout(df):
return df[~df.id.isin(ids_to_remove)]
filtered_df = filterout(df)
filtered_df.to_parquet('path/to/save/filtered.parquet')
CodePudding user response:
That's because the filtering didn't remove any row from your file, but it may have changed some cells to be "bigger". For example, a float could have been changed from 5 to 5.0.
CodePudding user response:
So this is what I found...
Doing df.info()
I realized that I had different index dtypes in both dataframes. This was causing the size differences...
How I solved it?
simple, just doing df = df.reset_index(drop=True)
before saving the data.
I discovered this thanks to this entry: python - Dataframes with RangeIndex vs.Int64Index - Why?