How to properly structure Spark code in Databricks?-CodePudding

I have a relatively big project in Azure Databricks that will soon go to production. The code is currently organized in a few folders in a repository and the tasks are triggered using ADF and job clusters executing notebooks one after another. The notebooks have some hardcoded values like input path, output path etc.

I don't think it is the best approach.

I would like to get rid of hardcoded values and rely on some environment variables/environment file/environment class or something like that.

I was thinking about creating a few classes that will have methods with induvidual transformations and with save operations outside of the transformations. Can you give me some tips? How do I reference one scala script from another in Databricks? Should I create a JAR?

Or can you refer me to some documentation/good public repositories where I can see how it should be done?

CodePudding user response：

It's hard to write a comprehensible guide on how to go to prod but here are some things I wish I knew earlier.

When going to production:

Try to migrate to jar jobs once you have a well established flow.
Notebooks are for exploratory tasks and not recommended for long running jobs.
You can pass params to your main, read environment vars or read the spark config. It's up to you how to pass the config.
Choose New Job Cluster and avoid All Purpose Cluster.
In production, Databricks recommends using new clusters so that each task runs in a fully isolated environment. The pricing is different for New Job Cluster. I would say it ends up cheaper.
Here is how to deal with secrets

.. and few other off-topic ideas:

I would recommend taking a look into CI\CD Jenkins recipes
Automate deployments with the Databricks cli

CodePudding user response：

If you're using notebooks for your code, then it's better to split code into following pieces:

Notebooks with "library functions" ("library notebooks") - only defining functions that will transform data. These functions are usually just receive DataFrame some parameters, perform transformation(s) and return new DataFrame. These functions shouldn't read/write data, or at least shouldn't have hardcoded paths.
Notebooks that are entry point of jobs (let's call them "main") - they may receive some parameters via widgets, for example, you can pass environment name (prod/dev/staging), file paths, etc. These "main" notebooks may include "library notebooks" using %run with relative paths, like, %run ./Library1, %run folder/Libray2 (see doc)
Notebooks that are used for testing - they also include "library notebooks", but add the code that call the functions & check results. Usually you need to have specialized libraries, like, spark-testing-base (Scala & Python), chispa (Python only), spark-fast-tests (Scala only), etc. to compare content of the DataRrames, schema, etc. (here are examples of using different libraries) These test notebooks could be triggered as either regular jobs or from CI/CD pipeline. For that you can use Databricks CLI or dbx tool (wrapper around Databricks CLI). I have a demo of CI/CD pipeline with notebooks, although it's for Python.

For notebooks it's recommended to use Repos functionality that allows to perform version control operations with multiple notebooks at once.

Depending on the size of your code, and how often it changes you can also package it as a library that will be attached to a cluster, and used from the "main notebooks". In this case it could be a bit easier to test that library functions - you can just use standard tooling, like, Maven, SBT, etc.

P.S. You can also reach solutions architect assigned to your account (if there is one), and discuss that topic in more details.