I am exploring spark job recovery mechanism and I have a queries related to it,
- How spark recovers from driver node failure
- recovery from executor node failures
- what are the ways to handle such scenarios ?
CodePudding user response:
- Driver node Failure:: If driver node which is running our spark Application is down, then Spark Session details will be lost and all the executors with their in-memory data will get lost. If we restart our application, getorCreate() method will reinitialize spark sesssion from the checkpoint directory and resume processing.
On most cluster managers, Spark does not automatically relaunch the driver if it crashes, so we need to monitor it using a tool like monit and restart it. The best way to do this is probably specific to environment. One place where Spark provides more support is the Standalone cluster manager, which supports a --supervise flag when submitting driver that lets Spark restart it. We will also need to pass --deploy-mode cluster to make the driver run within the cluster and not on your local machine like:
./bin/spark-submit --deploy-mode cluster --supervise --master spark://... App.jar
Imp Point: When the driver crashes, executors in Spark will also restart.
- Executor Node Failure: Any of the worker nodes running executor can fail, thus resulting in loss of in-memory.
For failure of a executor node, Spark uses the same techniques as Spark for its fault tolerance. All the data received from external sources is replicated among the worker nodes. All RDDs created through transformations of this replicated input data are tolerant to failure of a worker node, as the RDD lineage allows the system to recompute the lost data all the way from the surviving replica of the input data.
I hope I covered third question in the above points itself