I am trying to run a pyspark job using yarn with the spark.shuffle.service.enabled=true option but the job never completes :
Without the option, the job works well:
user@e7524bf7f996:~$ pyspark --master yarn
Using Python version 3.9.7 (default, Sep 16 2021 13:09:58)
Spark context Web UI available at http://e7524bf7f996:4040
Spark context available as 'sc' (master = yarn, app id = application_1644937120225_0004).
SparkSession available as 'spark'.
>>> sc.parallelize(range(10)).sum()
45
With the option --conf spark.shuffle.service.enabled=true
user@e7524bf7f996:~$ pyspark --master yarn --conf spark.shuffle.service.enabled=true
Using Python version 3.9.7 (default, Sep 16 2021 13:09:58)
Spark context Web UI available at http://e7524bf7f996:4040
Spark context available as 'sc' (master = yarn, app id = application_1644937120225_0005).
SparkSession available as 'spark'.
>>> sc.parallelize(range(10)).sum()
2022-02-15 15:10:14,591 WARN cluster.YarnScheduler: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
2022-02-15 15:10:29,590 WARN cluster.YarnScheduler: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
2022-02-15 15:10:44,591 WARN cluster.YarnScheduler: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
Are there other options in Spark or Yarn that should be enabled to make spark.shuffle.service.enabled work ?
I am running Spark 3.1.2 , Python 3.9.7, hadoop-3.2.1
Thank you,
Bertrand
CodePudding user response:
You need to configure external shuffle service on Yarn cluster by following
- Build Spark with the YARN profile. Skip this step if you are using a pre-packaged distribution.
- Locate the
spark-<version>-yarn-shuffle.jar
. This should be under $SPARK_HOME/common/network-yarn/target/scala- if you are building Spark yourself, and under yarn if you are using a distribution. - Add this jar to the classpath of all NodeManagers in your cluster.
- In the
yarn-site.xml
on each node, add spark_shuffle toyarn.nodemanager.aux-services
, then setyarn.nodemanager.aux-services.spark_shuffle.class
toorg.apache.spark.network.yarn.YarnShuffleService
. - Increase
NodeManager's heap size by setting
YARN_HEAPSIZE
(1000 by default) inetc/hadoop/yarn-env.sh
to avoid garbage collection issues during shuffle. - Restart all NodeManagers in your cluster.
For details, please refer https://spark.apache.org/docs/latest/running-on-yarn.html#configuring-the-external-shuffle-service
If still not working, check below:
- Check Yarn UI to ensure enough resources available.
- Try
--deploy-mode cluster
to ensure driver could communicate with yarn cluster for scheduling
CodePudding user response:
Thanks Warren for your help.
Here is the setup working for me:
https://github.com/BertrandBrelier/SparkYarn/blob/main/yarn-site.xml
echo "export YARN_HEAPSIZE=2000" >> /home/user/hadoop-3.2.1/etc/hadoop/yarn-env.sh
ln -s /home/user/spark-3.1.2-bin-hadoop3.2/yarn/spark-3.1.2-yarn-shuffle.jar /home/user/hadoop-3.2.1/share/hadoop/yarn/lib/.
echo "spark.shuffle.service.enabled true" >> /home/user/spark-3.1.2-bin-hadoop3.2/conf/spark-defaults.conf
restarting Hadoop and Spark
I was able to start a pyspark session:
pyspark --conf spark.shuffle.service.enabled=true --conf spark.dynamicAllocation.enabled=true