How to avoid jar conflicts in a databricks workspace with multiple clusters and developers working i-CodePudding

We are working in an environment where multiple developers upload jars to a Databricks cluster with the following configuration:

DBR: 7.3 LTS
Operating System: Ubuntu 18.04.5 LTS
Java: Zulu 8.48.0.53-CA-linux64 (build 1.8.0_265-b11)
Scala: 2.12.10
Python: 3.7.5
R: R version 3.6.3 (2020-02-29)
Delta Lake: 0.7.0
Build tool: Maven

Below is our typical workflow:

STEP 0:

Build version 1 of the jar (DemoSparkProject-1.0-SNAPSHOT.jar) with the following object:

object EntryObjectOne {

  def main(args:Array[String]): Unit = {
    val spark = SparkSession.builder()
      .appName("BatchApp")
      .master("local[*]")
      .getOrCreate()

    import spark.implicits._

    println("EntryObjectOne: This object is from 1.0 SNAPSHOT JAR")

    val df: DataFrame = Seq(
      (1,"A","2021-01-01"),
      (2,"B","2021-02-01"),
      (3,"C","2021-02-01")
    ).toDF("id","value", "date")

    df.show(false)
  }
}

STEP 1:

Uninstall the old jar(s) from the cluster, and keep pushing new changes in subsequent versions with small changes to the logic. Hence, we push jars with versions 2.0-SNAPSHOT, 3.0-SNAPSHOT etc.

At a point in time, when we push the same object with the following code in the jar say (DemoSparkProject-4.0-SNAPSHOT.jar):

object EntryObjectOne {

  def main(args:Array[String]): Unit = {
    val spark = SparkSession.builder()
      .appName("BatchApp")
      .master("local[*]")
      .getOrCreate()

    import spark.implicits._

    println("EntryObjectOne: This object is from 4.0 SNAPSHOT JAR")

    val df: DataFrame = Seq(
      (1,"A","2021-01-01"),
      (2,"B","2021-02-01"),
      (3,"C","2021-02-01")
    ).toDF("id","value", "date")

    df.show(false)
  }
}

When we import this object in the notebook and run the main function we still get the old snapshot version jar println statement (EntryObjectOne: This object is from 1.0 SNAPSHOT JAR). This forces us from running a delete on the dbfs:/FileStore/jars/* and restarting the cluster and pushing the latest snapshot again to make it work.

In essence when I run sc.listJars() the active jar in the driver is the latest 4.0-SNAPSHOT jar. Yet, I still see the logic from old snapshot jars even though they are not installed on the cluster at runtime.

Resolutions we tried/implemented:

We tried using the maven shade plugin, but unfortunately, Scala does not support it. (details here).
We delete the old jars from dbfs:/FileStore/jars/* and restart the cluster and install the new jars regularly. This works, but a better approach can definitely help. (details here).
Changing the classpath manually and building the jar with different groupId using maven also helps. But with lots of objects and developers working in parallel, it is difficult to keep track of these changes.

Is this the right way of working with multiple jar versions in DataBricks? If there is a better way to handle this version conflict issue in DataBricks it will help us a lot.

CodePudding user response：

You can't do with libraries packaged as Jar - when you install library, it's put into classpath, and will be removed only when you restart the cluster. Documentation says explicitly about that:

When you uninstall a library from a cluster, the library is removed only when you restart the cluster. Until you restart the cluster, the status of the uninstalled library appears as Uninstall pending restart.

It's the same issue as with "normal" Java programs, Java just doesn't support this functionality. See, for example, answers to this question.

For Python & R it's easier because they support notebook-scoped libraries, where different notebooks can have different versions of the same library.

P.S. If you're doing unit/integration testing, my recommendation would be to execute tests as Databricks jobs - it will be cheaper, and you won't have conflict between different versions.

CodePudding user response：

In addition to what's mentioned in the docs: when working with notebooks you could understand what's added on the driver by running this in a notebook cell:

%sh
ls /local_disk0/tmp/ | grep addedFile

This worked for me on Azure Databricks and it will list you all added jars. Maybe force a cleanup with init scripts ?