I have the following simplified code:
import socket
from pyspark import TaskContext
def f(_):
partition_id = TaskContext().partitionId()
return partition_id, socket.gethostname()
rdd.partitionBy(...).mapPartition(f).collect()
The problem is that sometimes the same worker gets two partitions and one worker is idle (I'm working with three Worker nodes). For example, the above example prints:
1, "Computer 1"
2, "Computer 1"
Is there a way to ensure that any worker processes at least one partition (if there are enough partitions)? I don't care which node processes which partition, I just want to ensure that parallelism is optimized.
CodePudding user response:
That's expected. Spark will do "its best" to distribute the data equally among the workers, but it doesn't always do as we expected. Try this example:
rdd = sc.parallelize([
('a', 1),
('a', 2),
('b', 3),
('b', 4),
('c', 5),
])
def print_partitions(lst):
for i, a in enumerate(lst):
print(i, a)
for i in range(1, 20):
print(f'test {i} partitions:')
print_partitions(rdd.partitionBy(i).glom().collect())
print('---\n')
The more partitions you make, the higher chance you'll have your data distributed equally (i.e. all a
in one partition etc)
CodePudding user response:
In general, No you can't guarantee this.
You can encourage it, by waiting for data locality, increasing data replication, or increasing the number of partitions.
Technically if you replicated your data to every node, you should always have a node with your data, but that's expensive space wise and would not be a good practice unless you really required it.