We are trying to create avro record with confluent schema registry. The same record we want to publish to kafka cluster.
To attach schema id to each records (magic bytes) we need to use--
to_avro(Column data, Column subject, String schemaRegistryAddress)
To automate this we need to build project in pipeline & configure databricks jobs to use that jar.
Now the problem we are facing in notebooks we are able to find a methods with 3 parameters to it.
But the same library when we are using in our build downloaded from https://mvnrepository.com/artifact/org.apache.spark/spark-avro_2.12/3.1.2 its only having 2 overloaded methods of to_avro
Is databricks having some other maven repository for its shaded jars?
NOTEBOOK output
import org.apache.spark.sql.avro.functions
println(functions.getClass().getProtectionDomain().getCodeSource().getLocation())
// file:/databricks/jars/----workspace_spark_3_1--vendor--avro--avro_2.12_deploy_shaded.jar
functions
.getClass()
.getMethods()
.filter(p=>p.getName.equals("to_avro"))
.foreach(f=>println(f.getName, f.getParameters.mkString("Array(", ", ", ")")))
// (to_avro,Array(final org.apache.spark.sql.Column data, final org.apache.spark.sql.Column subject, final java.lang.String schemaRegistryAddress, final java.lang.String jsonFormatSchema))
// (to_avro,Array(final org.apache.spark.sql.Column data, final org.apache.spark.sql.Column subject, final java.lang.String schemaRegistryAddress))
// (to_avro,Array(final org.apache.spark.sql.Column data, final java.lang.String jsonFormatSchema))
// (to_avro,Array(final org.apache.spark.sql.Column data))
LOCAL output
import org.apache.spark.sql.avro.functions
println(functions.getClass().getProtectionDomain().getCodeSource().getLocation())
// file:/<home-dir-path>/.gradle/caches/modules-2/files-2.1/org.apache.spark/spark-avro_2.12/3.1.2/1160ae134351328a0ed6a062183faf9a0d5b46ea/spark-avro_2.12-3.1.2.jar
functions
.getClass()
.getMethods()
.filter(p=>p.getName.equals("to_avro"))
.foreach(f=>println(f.getName, f.getParameters.mkString("Array(", ", ", ")")))
// (to_avro,Array(final org.apache.spark.sql.Column data, final java.lang.String jsonFormatSchema))
// (to_avro,Array(final org.apache.spark.sql.Column data))
Versions
Databricks => 9.1 LTS
Apache Spark => 3.1.2
Scala => 2.12
CodePudding user response:
No, these jars aren't published to any public repository. You may check if the databricks-connect
provides these jars (you can get their location with databricks-connect get-jar-dir
), but I really doubt in that.
Another approach is to mock it, for example, create a small library that will declare a function with specific signature, and use it for compilation only, don't include into the resulting jar.