I have some scheduled data pipelines that are orchestrated via Azure Data Factory, each with a Databricks activity that runs on a job cluster.
All my Databricks activities are stuck in retry loops and failing with the following error,
Databricks execution failed with error state: InternalError, error message: Unexpected failure while waiting for the cluster <cluster-id> to be ready.Cause Cluster <cluster-id> is unusable since the driver is unhealthy.
My Databricks cluster is not even starting up.
This issue is quite similar to what has been posted here,
AWS Databricks cluster start failure
However, there are a few differences,
- My pipelines are running on Azure: Azure Data Factory and Azure Databricks
- I can spin up my interactive clusters (in the same workspace) without any problem
- I have checked with my colleagues who are running similar pipelines on different subscriptions (in the same region), but they are not facing any issue
Any idea what is going on here? Is it just a service interruption of sorts or is there something I can do resolve this?
CodePudding user response:
It turns out that my pipelines were failing because the init script that has been configured for our clusters is not executing correctly.
We have a in-built Python package that we maintain in Azure Artifacts. To install this package, we need to use a DevOps token. To install the package in our clusters, a command is available in the init script and because the token has expired, the init script was failing.
As a result, the cluster could not start up properly. The error message is quite cryptic though. "Cause Cluster is unusable since the driver is unhealthy" could literally mean anything.
However, if you come across this yourselves, check your init script.
Note: Another hint here was that when we looked through the Event log, we noticed that the time between the events INIT_SCRIPTS_STARTED
and INIT_SCRIPTS_FINISHED
was very long. More so than it should actually take.