I 'm a bit confused about how master and worker nodes are assigned to the respective connected machines (VMs) on the network in the cluster mode of Spark.
I have two nodes which i do the Hadoop configuration on one (i consider to be the principal node) mapred-set.xml core-site.xml hdfs-site.xml, hadoop-env.sh, worker
then the Yarn related configuration files (i chose the resource manager to be Yarn in my case). Under hadoop main folder i set the worker IPs in the worker
file. I then replicate the same hadoop whole folder on the second node and set the Hadoop path.
My question is when i launch a Spark job (using Spark-submit) what is the process workflow that is responsible of assigning a master node and a worker node.
On a basic example without Hadoop i would specify explicitly the worker and the master through launching on each machine either start-slave.sh
or start-master.sh
, but how does Spark / Hadoop assign Worker and Master nodes through Hadoop configuration files mainly ?
Thanks !
CodePudding user response:
The Driver and Executors requests containers from yarn to launch and do work. Yarn takes care of the allocations for you so you don't need to worry about where the master(driver)/slave(executor) are allocated.