Home > Enterprise >  pyspark partitioning create an extra empty file for every partition
pyspark partitioning create an extra empty file for every partition

Time:01-24

I am facing with one problem in Azure Databricks. In my notebook I am executing simple write command with partitioning:

df.write.format('parquet').partitionBy("startYear").save(output_path,header=True)

And I see something like this: enter image description here

Can someone explain why spark is creating this additional empty files for every partition and how to disable it?

I tried different mode for write, different partitioning and spark versions

CodePudding user response:

I reproduced the above and got the same results when I use Blob Storage.

enter image description here

Can someone explain why spark is creating this additional empty files for every partition and how to disable it?

Spark won't create these types of files. Blob Storage creates the blobs like above when we create parquet files by partitions.

We cannot avoid these if you use Blob Storage. You can avoid it by using ADLS Storage.

These are my Results with ADLS:

enter image description here

  • Related