I'm working in a Scala Spark project where we load data from a file into PostgreSQL. It runs fine locally in standalone mode with a small test data, using jdbc.write.
But since the production data is huge, I want to use a cluster with multiple workers and 1 logical processor core per executer.
With that in mind, how do I partition the data between all available cores in the cluster?
Thanks!
PS: Using Scala 2.13.9 and Spark 3.3.0
CodePudding user response:
If you are using dynamic allocation and your cluster is used by concurrent jobs it may be hard to get number of partitions exactly equals to number of cores which your job may use as you are not going to know this number upfront and you can't calculate it dynamically.
You may try figure out some arbitrary number and set numPartitions jdbc parameter to number of partitions you want to use on write. With this parameter Spark is going to repartition this dataset before write and you will end up with number of tasks on write equals to numPartitions. Remember that each task written in parallel = 1 jdbc connection so be aware that you may overflow your PostreSQL
numPartitions (none) The maximum number of partitions that can be used for parallelism in table reading and writing. This also determines the maximum number of concurrent JDBC connections. If the number of partitions to write exceeds this limit, we decrease it to this limit by calling coalesce(numPartitions) before writing. read/write