Home > database >  Moving a Pyspark project development form Databricks UI to VSCode using databricks connect
Moving a Pyspark project development form Databricks UI to VSCode using databricks connect

Time:01-08

I am inheriting a huge pyspark project and instead of using the the Databricks UI for development I would like to use vscode via databricks-connect. Because of this I am failing to determine the best practices for the following:

  • Because the project files were saved as .py in the repos, when I open them using VSCode its not recognising the databricks magic commands like run. So I can not run any cell that calls another notebook with %run ./PATH/TO-ANOTHER-FILE. Changinging the file to .ipynb or changing the call to dbutils.notebook.run will solve the issue but it will mean changing cells in almost 20 notebooks. Using dbutils also poses the next challenge.

  • Since databricks creates the spark session for you behind the scenes, there was no need to use spark = SparkSession.builder.getOrCreate() when coding in the databricks UI. But when using databricks connect, you will have to manually create a SparkSession that connects to the remote cluster. This means for me to use dbutils I will have to do the following:

       from pyspark.dbutils import DBUtils
       dbutils = DBUtils(spark)
    

Changing the whole code base to fit my preferred developmental strategy does not seem to be justifiable. Any pointers on how I can circumvent this?

CodePudding user response:

Just want to mention that Databricks connect is in the maintenance mode and will be replaced with the new solution later this year.

But really, to migrate to the VSCode you don't need databricks-connect. There are few options here:

  • Use dbx tool for local code development, so you can run unit tests locally, and integration tests/jobs on Databricks. dbx includes dbx init command that can generate a skeleton of the project with recommended directory structure and code skeletons for unit/integration tests, CI/CD pipeline, etc.

  • Switch to what I call "mixed development" with Databricks Repos - it includes functionality allowing to use Python files in Repos be used as normal Python packages, so you can avoid using %run, and just do normal Python imports. You can also develop locally with Repos by using dbx sync command that will replicate your local changes to Repos, so you can make changes in VSCode, maybe run unit tests, and then execute modified code in the notebooks.

Regarding the use of spark - in your code, especially you can replace them with SparkSession.getActiveSession() calls that will pull active Spark session from environment, in this case, you can instantiate it only in unit tests (I recommend to use pytest-spark package to simplify it) and then the rest of the code won't need SparkSession.builder.getOrCreate() as it will run on Databricks that will instantiate it (if you use notebooks as entry point). Problems with dbutils are also solvable, like described in this answer.

  • Related