Home > Enterprise >  executor.CoarseGrainedExecutorBackend (Logging.scala:logInfo(57)) - Driver commanded a shutdown - ho
executor.CoarseGrainedExecutorBackend (Logging.scala:logInfo(57)) - Driver commanded a shutdown - ho

Time:12-06

I'm getting that logs from the executor (beginning at the buttom):

2021-11-30 21:44:42 
2021-11-30 18:44:42,911 INFO  [shutdown-hook-0] util.ShutdownHookManager (Logging.scala:logInfo(57)) - Deleting directory /var/data/spark-0646270c-a2d0-47d4-8e6c-0bc735bc255d/spark-a54cf7e4-baaf-4411-9073-0c1fb1e4cc5b
2021-11-30 21:44:42 
2021-11-30 18:44:42,910 INFO  [shutdown-hook-0] util.ShutdownHookManager (Logging.scala:logInfo(57)) - Shutdown hook called
2021-11-30 21:44:42 
2021-11-30 18:44:42,902 ERROR [SIGTERM handler] executor.CoarseGrainedExecutorBackend (SignalUtils.scala:$anonfun$registerLogger$2(43)) - RECEIVED SIGNAL TERM
2021-11-30 21:44:42 
2021-11-30 18:44:42,823 INFO  [CoarseGrainedExecutorBackend-stop-executor] storage.BlockManager (Logging.scala:logInfo(57)) - BlockManager stopped
2021-11-30 21:44:42 
2021-11-30 18:44:42,822 INFO  [CoarseGrainedExecutorBackend-stop-executor] memory.MemoryStore (Logging.scala:logInfo(57)) - MemoryStore cleared
2021-11-30 21:44:42 
2021-11-30 18:44:42,798 INFO  [dispatcher-Executor] executor.CoarseGrainedExecutorBackend (Logging.scala:logInfo(57)) - Driver commanded a shutdown

How I can enable any kind of logging in the Spark Driver to understand, what kind of event on the Driver has triggered the executor to shutdown? There is no lack of the memory to Driver or Executor, the pod metrics show that they occupy much more than it is limited overhead. So, looks like the reason of shutdown signal isn't a lack of the resoures, but may be some hidden exception, not logged anywhere.

According to the advice of @mazaneicha I have tried to set longer timeouts, but still getting the same error

implicit val spark: SparkSession = SparkSession
  .builder
  .master("local[1]")
  .config(new SparkConf().setIfMissing("spark.master", "local[1]")
    .set("spark.eventLog.dir", "file:///tmp/spark-events")
    .set("spark.dynamicAllocation.executorIdleTimeout", "100s")  //spark.dynamicAllocation.executorIdleTimeout
    .set("spark.dynamicAllocation.schedulerBacklogTimeout", "100s")    //spark.dynamicAllocation.schedulerBacklogTimeout
  )
  .getOrCreate()

CodePudding user response:

The reason of the failure was actually posted to the logs:

2021-12-01 15:05:46,906 WARN  [main] streaming.StreamingQueryManager (Logging.scala:logWarning(69)) - Stopping existing streaming query [id=b13a69d7-5a2f-461e-91a7-a9138c4aa716, runId=9cb31852-d276-42d8-ade6-9839fa97f85c], as a new run is being started.

WHy the query were stopped? That's because in Scala I was creating streaming queries in a loop, based on collection, while keeping all the query names and all the checkpoint names the same. After making them unique (i just used the string values from the collection), the failure problem has gone.

  • Related