I am trying to run the spark-submit command on my Hadoop cluster Here is a summary of my Hadoop Cluster:
- The cluster is built using 5 VirtualBox VM's connected on an internal network
- There is 1 namenode and 4 datanodes created.
- All the VM's were built from the Bitnami Hadoop Stack VirtualBox image
I am trying to run one of the spark examples using the following spark-submit
command
spark-submit --class org.apache.spark.examples.SparkPi $SPARK_HOME/examples/jars/spark-examples_2.12-3.0.3.jar 10
I get the following error:
[2022-07-25 13:32:39.253]Container exited with a non-zero exit code 1. Error file: prelaunch.err.
Last 4096 bytes of prelaunch.err :
Last 4096 bytes of stderr :
Error: Could not find or load main class org.apache.spark.deploy.yarn.ExecutorLauncher
I get the same error when trying to run a script with PySpark.
I have tried/verified the following:
- environment variables:
HADOOP_HOME
,SPARK_HOME
andHADOOP_CONF_DIR
have been set in my.bashrc
file SPARK_DIST_CLASSPATH
andHADOOP_CONF_DIR
have been defined inspark-env.sh
- Added
spark.master yarn
,spark.yarn.stagingDir hdfs://hadoop-namenode:8020/user/bitnami/sparkStaging
andspark.yarn.jars hdfs://hadoop-namenode:8020/user/bitnami/spark/jars/
inspark-defaults.conf
- I have uploaded the jars into hdfs (i.e.
hadoop fs -put $SPARK_HOME/jars/* hdfs://hadoop-namenode:8020/user/bitnami/spark/jars/
) - The logs accessible via the web interface (i.e.
http://hadoop-namenode:8042
) do not provide any further details about the error.
CodePudding user response:
This section of the Spark documentation seems relevant to the error since the YARN libraries should be included, by default, but only if you've installed the appropriate Spark version
For
with-hadoop
Spark distribution, since it contains a built-in Hadoop runtime already, by default, when a job is submitted to Hadoop Yarn cluster, to prevent jar conflict, it will not populate Yarn’s classpath into Spark. To override this behavior, you can setspark.yarn.populateHadoopClasspath=true
. For no-hadoop Spark distribution, Spark will populate Yarn’s classpath by default in order to get Hadoop runtime. Forwith-hadoop
Spark distribution, if your application depends on certain library that is only available in the cluster, you can try to populate the Yarn classpath by setting the property mentioned above. If you run into jar conflict issue by doing so, you will need to turn it off and include this library in your application jar.
https://spark.apache.org/docs/latest/running-on-yarn.html#preparations
Otherwise, yarn.application.classpath
in yarn-site.xml
refers to local filesystem paths in each of ResourceManager servers where JARs are available for all YARN applications (spark.yarn.jars
or extra packages should get layered onto this)
Another problem could be file permissions. You probably shouldn't put Spark jars into an HDFS user folder if they're meant to be used by all users. Typically, I'd put it under hdfs:///apps/spark/<version>
, then give that 744 HDFS permissions
In the Spark / YARN UI, it should show the complete classpath of the application for further debugging
CodePudding user response:
I figured out why I was getting this error. It turns out that I made an error while specifying spark.yarn.jars
in spark-defaults.conf
The value of this property must be
hdfs://hadoop-namenode:8020/user/bitnami/spark/jars/*
instead of
hdfs://hadoop-namenode:8020/user/bitnami/spark/jars/
i.e. Basically, we need to specify the jar files as the value to this property and not the folder containing the jar files.