Home > Enterprise >  Unit testing for functions defined on databricks notebook and unit testing for pyspark DF
Unit testing for functions defined on databricks notebook and unit testing for pyspark DF

Time:07-19

I have defined few functions and calling it in other notebooks, i want to create a notebook which does unit testing for all these functions in ADF and also need to do unit testing like count match between source file and data frame.

How to achieve this?

CodePudding user response:

You can do unit testing of Databricks Notebooks using Databricks Connect; a way of remotely executing code on your Databricks Cluster.

Start by cloning the repository that goes along with this blog post here

Now create a new virtual environment and run:

pip install -r requirements.txt

Then you’ll have to set up your Databricks Connect. You can do this by running databricks-connect configure as per given in above shared document.

You can test your Databricks Connect is working correctly by running:

databricks-connect test

Source: https://benalexkeen.com/unit-testing-with-databricks-part-1/

CodePudding user response:

There are two things here:

  • Unit testing - this could be done just by running .count() on the data frame, and wrap it into the assert, or use one of the specialized unit testing frameworks for Spark - for example, chispa or spark-testing-base (you can find examples for them in my repo).

  • Execution of the tests - there are different approaches for that:

    • You can just execute code from the notebook as a job and throw exception if test fails. Similar to what is described in the documentation.
    • Use the Nutter library to trigger execution of the tests in one or more notebooks - but this library is more optimized for execution from CI/CD pipeline, not from ADF. You can find example of using this library in the following repository.
  • Related