Home > Net >  Is the number of partitions of a parquet file in ADLS same as number of partitions after reading it
Is the number of partitions of a parquet file in ADLS same as number of partitions after reading it

Time:07-19

I have 3 parquet files in ADLS

2 parquet files have 10 sub-parquet files and when i read it as a dataframe in databricks using pyspark, the number of partitions are equal to 10 which is expected behaviour.

3rd file has 172 snappy.parquet files and when i read it as a dataframe, the number of partitions are equal to 89, what is the reason behind this?

Used this command df.rdd.getNumPartitions() to find the number of partitions of a dataframe.

CodePudding user response:

When reading, Spark is trying to create Spark partitions of size not bigger than specified by spark.files.maxPartitionBytes (128Mb by default). When reading files, Spark will look for file size, and take it into account - when file size is smaller than desired partition size, then partition will be created from multiple files, and when file size is bigger than desired partition size, then it will be split into multiple partitions (if the format is splittable, like, Parquet).

In your case, it looks like you have a lot of files smaller than desired partition size.

  • Related