Parquet file is bigger after filtering-CodePudding

I'm working with a parquet file in python with Pandas and I got a surprise.

The parquet file that I read weighs 290MB and when I do a filtering to remove records from that dataframe and save it with df.to_parquet() I find that the new parquet weighs 390MB

How can that be?

Here my code example:

df = read_parquet('path/to/file.parquet')
ids_to_remove = ['id1','id2']

def filterout(df):
    return df[~df.id.isin(ids_to_remove)]

filtered_df = filterout(df)
filtered_df.to_parquet('path/to/save/filtered.parquet')

CodePudding user response：

That's because the filtering didn't remove any row from your file, but it may have changed some cells to be "bigger". For example, a float could have been changed from 5 to 5.0.

CodePudding user response：

So this is what I found... Doing df.info() I realized that I had different index dtypes in both dataframes. This was causing the size differences...

How I solved it? simple, just doing df = df.reset_index(drop=True) before saving the data.

I discovered this thanks to this entry: python - Dataframes with RangeIndex vs.Int64Index - Why?