How initial partitions happens in spark while reading?-CodePudding

Initial partitions of data happens on what basis in spark while reading from big csv file ?

How it will decide to have number of partitions/split of large file data into different workers nodes while reading from Big csv file ?

Can anyone share, how its being done ?

CodePudding user response：

When reading non-bucketed HDFS files (e.g. parquet) with spark-sql, the number of DataFrame partitions df.rdd.getNumPartitions depends on these factors:

spark.default.parallelism (roughly translates to #cores available for the application)
spark.sql.files.maxPartitionBytes (default 128MB)
spark.sql.files.openCostInBytes (default 4MB)
spark.sql.files.minPartitionNum (optional, introduced in spark 3.1)

A rough estimation of the number of partitions is:

PartitionSize ≈ min(maxPartitionBytes, max(4MB, TotalDataSize/#cores))
NumberOfPartitions ≈ max(TotalDataSize/PartitionSize, minPartitionNum)

You can refer FilePartition for exact calculation.