I am trying to run the spark-submit
command on my Hadoop cluster
Here is a summary of my Hadoop Cluster:
- The cluster is built using 5 VirtualBox VM's connected on an internal network
- There is 1 namenode and 4 datanodes created.
- All the VM's were built from the Bitnami Hadoop Stack VirtualBox image
When I run the following command:
spark-submit --class org.apache.spark.examples.SparkPi $SPARK_HOME/examples/jars/spark-examples_2.12-3.0.3.jar 10
I receive the following error:
java.io.FileNotFoundException: File file:/home/bitnami/sparkStaging/bitnami/.sparkStaging/application_1658417340986_0002/__spark_conf__.zip does not exist
I also get a similar error when trying to create a sparkSession using PySpark:
spark = SparkSession.builder.appName('appName').getOrCreate()
I have tried/verified the following
- environment variables:
HADOOP_HOME
,SPARK_HOME
ANDHADOOP_CONF_DIR
have been set in my.bashrc
file SPARK_DIST_CLASSPATH
andHADOOP_CONF_DIR
have been defined inspark-env.sh
- Added
spark.master yarn
,spark.yarn.stagingDir file:///home/bitnami/sparkStaging
andspark.yarn.jars file:///opt/bitnami/hadoop/spark/jars/
inspark-defaults.conf
CodePudding user response:
Since the spark job is supposed to be submitted to the Hadoop cluster managed by YARN, master
and deploy-mode
has to be set. From the spark 3.3.0 docs:
# Run on a YARN cluster in cluster deploy mode
export HADOOP_CONF_DIR=XXX
./bin/spark-submit \
--class org.apache.spark.examples.SparkPi \
--master yarn \
--deploy-mode cluster \
--executor-memory 20G \
--num-executors 50 \
/path/to/examples.jar \
1000
Or programatically:
spark = SparkSession.builder().appName('appName').master("yarn").config("spark.submit.deployMode","cluster").getOrCreate()
CodePudding user response:
I believe spark.yarn.stagingDir
needs to be an HDFS path.
More specifically, the "YARN Staging directory" needs to be available on all Spark executors, not just a local file path from where you run spark-submit
The path that isn't found is being reported from the YARN cluster, where /home/bitnami
might not exist, or the Unix user running the Spark executor containers does not have access to that path.
Similarly, spark.yarn.jars
(or spark.yarn.archive
) should be HDFS paths because these will get downloaded, in parallel, across all executors.