Home > Mobile >  Unable to find Databricks spark sql avro shaded jars in any public maven repository
Unable to find Databricks spark sql avro shaded jars in any public maven repository

Time:02-15

We are trying to create avro record with confluent schema registry. The same record we want to publish to kafka cluster.

To attach schema id to each records (magic bytes) we need to use--
to_avro(Column data, Column subject, String schemaRegistryAddress)

To automate this we need to build project in pipeline & configure databricks jobs to use that jar.

Now the problem we are facing in notebooks we are able to find a methods with 3 parameters to it.
But the same library when we are using in our build downloaded from https://mvnrepository.com/artifact/org.apache.spark/spark-avro_2.12/3.1.2 its only having 2 overloaded methods of to_avro

Is databricks having some other maven repository for its shaded jars?

NOTEBOOK output

import org.apache.spark.sql.avro.functions

println(functions.getClass().getProtectionDomain().getCodeSource().getLocation())
// file:/databricks/jars/----workspace_spark_3_1--vendor--avro--avro_2.12_deploy_shaded.jar

functions
  .getClass()
  .getMethods()
  .filter(p=>p.getName.equals("to_avro"))
  .foreach(f=>println(f.getName, f.getParameters.mkString("Array(", ", ", ")")))
// (to_avro,Array(final org.apache.spark.sql.Column data, final org.apache.spark.sql.Column subject, final java.lang.String schemaRegistryAddress, final java.lang.String jsonFormatSchema))
// (to_avro,Array(final org.apache.spark.sql.Column data, final org.apache.spark.sql.Column subject, final java.lang.String schemaRegistryAddress))
// (to_avro,Array(final org.apache.spark.sql.Column data, final java.lang.String jsonFormatSchema))
// (to_avro,Array(final org.apache.spark.sql.Column data))

LOCAL output

import org.apache.spark.sql.avro.functions

println(functions.getClass().getProtectionDomain().getCodeSource().getLocation())
// file:/<home-dir-path>/.gradle/caches/modules-2/files-2.1/org.apache.spark/spark-avro_2.12/3.1.2/1160ae134351328a0ed6a062183faf9a0d5b46ea/spark-avro_2.12-3.1.2.jar

functions
  .getClass()
  .getMethods()
  .filter(p=>p.getName.equals("to_avro"))
  .foreach(f=>println(f.getName, f.getParameters.mkString("Array(", ", ", ")")))
// (to_avro,Array(final org.apache.spark.sql.Column data, final java.lang.String jsonFormatSchema))
// (to_avro,Array(final org.apache.spark.sql.Column data))

Versions

Databricks => 9.1 LTS
Apache Spark => 3.1.2
Scala => 2.12

CodePudding user response:

No, these jars aren't published to any public repository. You may check if the databricks-connect provides these jars (you can get their location with databricks-connect get-jar-dir), but I really doubt in that.

Another approach is to mock it, for example, create a small library that will declare a function with specific signature, and use it for compilation only, don't include into the resulting jar.

  • Related