We have a number of Python Databricks jobs that all use the same underlying Wheel package to install their dependencies. Installing this Wheel package even with a node that has been idling in a Pool still takes 90 seconds.
Some of these jobs are very long-running so we would like to use Jobs computer clusters for the lower cost in DBUs.
Some of these jobs are much shorter-running (<10 seconds) where the 90 second install time seems more significant. We have been considering using a hot cluster (All-Purpose Compute) for these shorter jobs. We would like to avoid the extra cost of the All-Purpose Compute if possible.
Reading the Databricks documentation suggests that the Idle instances in the Pool are reserved for us but not costing us DBUs. Is there a way for us to pre-install the required libraries on our Idle instances so that when a job comes through we are able to immediately start processing it?
Is there an alternate approach that can fulfill a similar use case?
CodePudding user response:
You can't install libraries directly into nodes from pool, because the actual code is executed in the Docker container corresponding to Databricks Runtime. There are several ways to speedup installation of the libraries:
- Create your own Docker image with all necessary libraries pre-installed, and pre-load Databricks Runtime version and your Docker image - this part couldn't be done via UI, so you need to use REST API (see description of
preloaded_docker_images
attribute), databrick-cli, or Databricks Terraform provider. The main disadvantage of custom Docker images is that some functionality isn't available out of box, for example, arbitrary files in Repos, web terminal, etc. (don't remember full list) - Put all necessary libraries and their dependencies onto DBFS and install them via cluster init script. It's very important that you collect binary dependencies, not packages only with the source code, so you won't need to compile them when installing. This could be done once:
- for Python this could be done with
pip download --prefer-binary lib1 lib2 ...
- for Java/Scala you can use
mvn dependency:get -Dartifact=<maven_coordinates>
, that will download dependencies into~/.m2/repository
folder, from which you can copy jars to DBFS and in init script usecp /dbfs/.../jars/* /databricks/jars/
command - for R, it's slightly more complicated, but is also doable
- for Python this could be done with