Why is the pyspark code not parallelizing to all the executors?-CodePudding

I have created a 7 nodes cluster on dataproc (1 master and 6 executors. 3 primary executors and 3 secondary preemptible executors). I can see in the console the cluster is created corrected. I have all 6 ips and VM names. I am trying to test the cluster but it seems the code is not running on all the executors but just 2 at max. Following is the code I am using to check the number of executors that the code executed on:

import numpy as np
import socket
set(sc.parallelize(range(1,1000000)).map(lambda x : socket.gethostname()).collect())

output:

{'monsoon-testing-sw-543d', 'monsoon-testing-sw-p7w7'}

I have restarted the kernel many times but, though the executors change the number of executors on which the code is executed remains the same.

Can somebody help me understand what is going on here and why pyspark is not parallelizing my code to all the executors?

CodePudding user response：

You have many executer to work, but not enough data partitions to work on. You can add the parameter numSlices in the parallelize() method to define how many partitions should be created:

rdd = sc.parallelize(range(1,1000000), numSlices=12)

The number of partitions should at least equal or larger than the number of executors for optimal work distribution.

Btw: with rdd.getNumPartitions() you can get the number of partitions you have in your RDD.