Dask .repartition(partition_size="100MB") is not respecting given size-CodePudding

I'm turning pandas dataframes into parquet files. For this I'm using dask, to help to partition the generated files.

my_dask_df = dask.dataframe.from_pandas(my_pandas_df, npartitions=1)
my_dask_df = my_dask_df.repartition(partition_size="100MB")
dask.dataframe.to_parquet(my_dask_df, destination_dir, name_function=lambda
        i: f"{dataset_name}-{dataset_version}-part-{i}.parquet")

As you can see in the small snippet, I specifying my_dask_df.repartition(partition_size="100MB"). To my expectation this should create partitions around 100MB.

Now, I've had now a few datasets, which are below 100mb or maybe just over that. The partitioned files turn out to be only 5mb and I'm ending up with n 5mb large/small files.

Why is this?

CodePudding user response：

The parquet file format compresses your data by default. The repartition argument is talking about the size of the data in memory, whereas you are looking at the size on disk.

The default compression algorithm right now is python-snappy. So the final size of your written file will depend on whether snappy can find patterns in your data which allows it to reduce the stored information. For example, if your data were all zeros, the resulting file would be very small, even though in memory an array of zeros is the same size as an array of random large values.

For what it’s worth, Dask’s repartition argument is not always going to be followed exactly - each partition should be roughly 100mb in memory, with the exception of the last one which will often be smaller. But the behavior you’re seeing is definitely related to the on-disk compression.

See the dask.dataframe.to_parquet docs for more info.