Home > Blockchain >  How to distribute data into X partitions on read with Spark?
How to distribute data into X partitions on read with Spark?

Time:09-22

I’m trying to read data from Hive with Spark DF and distribute it into a specific configurable number of partitions (in a correlation to the number of cores). My job is pretty straightforward and it does not contain any joins or aggregations. I’ve read on the spark.sql.shuffle.partitions property but the documentation says:

Configures the number of partitions to use when shuffling data for joins or aggregations.

Does this mean that it would be irrelevant for me to configure this property? Or does the read operation is considered as a shuffle? If not, what is the alternative? Repartition and coalesce seems a bit like an overkill for that matter.

CodePudding user response:

To verify my understanding of your problem, you want to increase number of partitions in your rdd/dataframe which is created immediately after reading data.

In this case the property you are after is spark.sql.files.maxPartitionBytes which controls the maximum data that can be pushed in a partition at max (please refer to https://spark.apache.org/docs/2.4.0/sql-performance-tuning.html) Default value is 128 MB which can be overridden to improve parallelism.

CodePudding user response:

Read is not a shuffle as such. You need to get the data in at some stage.

The answer below can be used or an algorithm by Spark sets the number of partitions upon a read.

You do not state if you are using RDD or DF. With RDD you can set num partitions. With DF you need to repartition after read in general.

Your point on controlling parallelism is less relevant when joining or aggregating as you note.

  • Related