Home > Back-end >  Custom python module in azure databricks with spark/dbutils dependencies
Custom python module in azure databricks with spark/dbutils dependencies

Time:03-10

I recently swicthed on the preview feature "files in repos" on Azure Databricks, so that I could move a lot of my general functions from notebooks to modules and get rid of the overhead from running a lot of notebooks for a single job.

However, several of my functions rely directly on dbutils or spark/pyspark functions (e.g. dbutils.secrets.get() and spark.conf.set()). Since these are imported in the background for the notebooks and are tied directly to the underlying session, I am at complete loss as to how I can reference these modules in my custom modules.

For my small sample module, I fixed it by making dbutils a parameter, like in the following example:

class Connection:
    def __init__(self, dbutils):
        token = dbutils.secrets.get(scope="my-scope", key="mykey")
        ...

However, doing this for all the existing functions would require a significant amount of rewriting both the functions and the lines that call them. How can I avoid this procedure and do it in a more clean manner?

CodePudding user response:

The documentation for Databricks Connect shows the example how it could be achieved. That example has SparkSession as an explicit parameter, but it could be modified to avoid that completely, with something like this:

def get_dbutils():
  from pyspark.sql import SparkSession
  spark = SparkSession.getActiveSession()
  if spark.conf.get("spark.databricks.service.client.enabled") == "true":
    from pyspark.dbutils import DBUtils
    return DBUtils(spark)
  else:
    import IPython
    return IPython.get_ipython().user_ns["dbutils"]

and then in your function you can do something like this:

def get_my_secret(scope, key):
  return get_dbutils().secrets.get(scope, key)
  • Related