"Error: Could not find or load main class org.apache.spark.deploy.yarn.ExecutorLauncher" w-CodePudding

I am trying to run the spark-submit command on my Hadoop cluster Here is a summary of my Hadoop Cluster:

The cluster is built using 5 VirtualBox VM's connected on an internal network
There is 1 namenode and 4 datanodes created.
All the VM's were built from the Bitnami Hadoop Stack VirtualBox image

I am trying to run one of the spark examples using the following spark-submit command

spark-submit --class org.apache.spark.examples.SparkPi $SPARK_HOME/examples/jars/spark-examples_2.12-3.0.3.jar 10

I get the following error:

[2022-07-25 13:32:39.253]Container exited with a non-zero exit code 1. Error file: prelaunch.err.
Last 4096 bytes of prelaunch.err :
Last 4096 bytes of stderr :
Error: Could not find or load main class org.apache.spark.deploy.yarn.ExecutorLauncher

I get the same error when trying to run a script with PySpark.

I have tried/verified the following:

environment variables: HADOOP_HOME, SPARK_HOME and HADOOP_CONF_DIR have been set in my .bashrc file
SPARK_DIST_CLASSPATH and HADOOP_CONF_DIR have been defined in spark-env.sh
Added spark.master yarn, spark.yarn.stagingDir hdfs://hadoop-namenode:8020/user/bitnami/sparkStaging and spark.yarn.jars hdfs://hadoop-namenode:8020/user/bitnami/spark/jars/ in spark-defaults.conf
I have uploaded the jars into hdfs (i.e. hadoop fs -put $SPARK_HOME/jars/* hdfs://hadoop-namenode:8020/user/bitnami/spark/jars/ )
The logs accessible via the web interface (i.e. http://hadoop-namenode:8042 ) do not provide any further details about the error.

CodePudding user response：

This section of the Spark documentation seems relevant to the error since the YARN libraries should be included, by default, but only if you've installed the appropriate Spark version

For with-hadoop Spark distribution, since it contains a built-in Hadoop runtime already, by default, when a job is submitted to Hadoop Yarn cluster, to prevent jar conflict, it will not populate Yarn’s classpath into Spark. To override this behavior, you can set spark.yarn.populateHadoopClasspath=true. For no-hadoop Spark distribution, Spark will populate Yarn’s classpath by default in order to get Hadoop runtime. For with-hadoop Spark distribution, if your application depends on certain library that is only available in the cluster, you can try to populate the Yarn classpath by setting the property mentioned above. If you run into jar conflict issue by doing so, you will need to turn it off and include this library in your application jar.

https://spark.apache.org/docs/latest/running-on-yarn.html#preparations

Otherwise, yarn.application.classpath in yarn-site.xml refers to local filesystem paths in each of ResourceManager servers where JARs are available for all YARN applications (spark.yarn.jars or extra packages should get layered onto this)

Another problem could be file permissions. You probably shouldn't put Spark jars into an HDFS user folder if they're meant to be used by all users. Typically, I'd put it under hdfs:///apps/spark/<version>, then give that 744 HDFS permissions

In the Spark / YARN UI, it should show the complete classpath of the application for further debugging

CodePudding user response：

I figured out why I was getting this error. It turns out that I made an error while specifying spark.yarn.jars in spark-defaults.conf

The value of this property must be

hdfs://hadoop-namenode:8020/user/bitnami/spark/jars/*

instead of

 hdfs://hadoop-namenode:8020/user/bitnami/spark/jars/

i.e. Basically, we need to specify the jar files as the value to this property and not the folder containing the jar files.