At work I have a Scala Databricks Notebook that uses many libraries imports, both from Maven and from some JAR files. The issue I have is that when I plan jobs on this Notebook, it sometimes fails (completely randomly but mostly 1 time over 10 runs) because it executes the cells before all libraries are installed. Thus the job fails and I have to go launch it manually. Such comportment from this Databricks' product is far from being professional as I can't use it in production because it sometimes fails.
I tried to put a Thread.Sleep()
of 1 minute or so before all my imports, but it does not change anything. For Python there's the dbutils.library.installPyPI("library-name")
but there's no such thing for Scala in the Dbutils documentation.
So does anyone have had the same issue and if so, how did you solve it ?
Thank you !
CodePudding user response:
Simply put for prod scheduled jobs use New Job Cluster
and avoid All Purpose Cluster
.
New Job Clusters
are dedicated clusters created and started when you run a task and terminate immediately after the task completes. In production, Databricks recommends using new clusters so that each task runs in a fully isolated environment.
In the UI, when setting up your notebook job select a New Job Cluster
and afterwards add all the dependent libraries to the job.
The pricing is different for New Job Cluster
. I would say it ends up cheaper.
Note: Use Databricks pools to reduce cluster start and auto-scaling times (if it's an issue to begin with).