do we need to install spark on yarn to read data from HDFS into Py Spark?-CodePudding

I am having a Hadoop 3.1.1 multi-node cluster, i want to make use of PySpark to read files from my HDFS into PySpark for ETL operations and then load it to target MySQL databases.

Given below is the ask.

can I install spark in standalone mode?
do I need to install spark on my yarn first?
if no, how can I install spark separately?

CodePudding user response：

You can use any mode for communicating with HDFS and MySQL, including Kubernetes. Or, you just use --master="local[*]" and you don't need a scheduler at all. This is useful, for example, from a Jupyter Notebook.

YARN would be recommended as you already have HDFS, and therefore the scripts to start YARN processes as well.

You don't really "install Spark on YARN". Applications from clients get submitted to the YARN cluster. spark.yarn.archives HDFS path will get unpacked into the classes necessary to run the job.

Refer https://spark.apache.org/docs/latest/running-on-yarn.html