spark.shuffle.service.enabled=true cluster.YarnScheduler: Initial job has not accepted any resources-CodePudding

I am trying to run a pyspark job using yarn with the spark.shuffle.service.enabled=true option but the job never completes :

Without the option, the job works well:

user@e7524bf7f996:~$ pyspark --master yarn                                                               
Using Python version 3.9.7 (default, Sep 16 2021 13:09:58)
Spark context Web UI available at http://e7524bf7f996:4040
Spark context available as 'sc' (master = yarn, app id = application_1644937120225_0004).
SparkSession available as 'spark'.
>>> sc.parallelize(range(10)).sum()
45

With the option --conf spark.shuffle.service.enabled=true

user@e7524bf7f996:~$ pyspark --master yarn --conf spark.shuffle.service.enabled=true
Using Python version 3.9.7 (default, Sep 16 2021 13:09:58)
Spark context Web UI available at http://e7524bf7f996:4040
Spark context available as 'sc' (master = yarn, app id = application_1644937120225_0005).
SparkSession available as 'spark'.
>>> sc.parallelize(range(10)).sum()
2022-02-15 15:10:14,591 WARN cluster.YarnScheduler: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
2022-02-15 15:10:29,590 WARN cluster.YarnScheduler: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
2022-02-15 15:10:44,591 WARN cluster.YarnScheduler: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources

Are there other options in Spark or Yarn that should be enabled to make spark.shuffle.service.enabled work ?

I am running Spark 3.1.2 , Python 3.9.7, hadoop-3.2.1

Thank you,

Bertrand

CodePudding user response：

You need to configure external shuffle service on Yarn cluster by following

Build Spark with the YARN profile. Skip this step if you are using a pre-packaged distribution.
Locate the spark-<version>-yarn-shuffle.jar. This should be under $SPARK_HOME/common/network-yarn/target/scala- if you are building Spark yourself, and under yarn if you are using a distribution.
Add this jar to the classpath of all NodeManagers in your cluster.
In the yarn-site.xml on each node, add spark_shuffle to yarn.nodemanager.aux-services, then set yarn.nodemanager.aux-services.spark_shuffle.class to org.apache.spark.network.yarn.YarnShuffleService.
Increase NodeManager's heap size by setting YARN_HEAPSIZE (1000 by default) in etc/hadoop/yarn-env.sh to avoid garbage collection issues during shuffle.
Restart all NodeManagers in your cluster.

For details, please refer https://spark.apache.org/docs/latest/running-on-yarn.html#configuring-the-external-shuffle-service

If still not working, check below:

Check Yarn UI to ensure enough resources available.
Try --deploy-mode cluster to ensure driver could communicate with yarn cluster for scheduling

CodePudding user response：

Thanks Warren for your help.

Here is the setup working for me:

https://github.com/BertrandBrelier/SparkYarn/blob/main/yarn-site.xml

echo "export YARN_HEAPSIZE=2000" >> /home/user/hadoop-3.2.1/etc/hadoop/yarn-env.sh

ln -s /home/user/spark-3.1.2-bin-hadoop3.2/yarn/spark-3.1.2-yarn-shuffle.jar /home/user/hadoop-3.2.1/share/hadoop/yarn/lib/.

echo "spark.shuffle.service.enabled    true" >> /home/user/spark-3.1.2-bin-hadoop3.2/conf/spark-defaults.conf

restarting Hadoop and Spark

I was able to start a pyspark session:

pyspark --conf spark.shuffle.service.enabled=true --conf spark.dynamicAllocation.enabled=true