Home > Software engineering >  GCP BigQuery dependency causes failure in spark read with error of Jackson databind dependency
GCP BigQuery dependency causes failure in spark read with error of Jackson databind dependency

Time:09-08

I need to process data from file using Spark and save to GCP BigQuery but i'm stuck with an exception when spark read happens and sbt has library dependency of GCP BigQuery as per main need.

Exception i face:

Caused by: com.fasterxml.jackson.databind.JsonMappingException: Scala module 2.12.3 requires Jackson Databind version >= 2.12.0 and < 2.13.0
at com.fasterxml.jackson.module.scala.JacksonModule.setupModule(JacksonModule.scala:61)
at com.fasterxml.jackson.module.scala.JacksonModule.setupModule$(JacksonModule.scala:46)
at com.fasterxml.jackson.module.scala.DefaultScalaModule.setupModule(DefaultScalaModule.scala:17)
at com.fasterxml.jackson.databind.ObjectMapper.registerModule(ObjectMapper.java:853)
at org.apache.spark.rdd.RDDOperationScope$.<init>(RDDOperationScope.scala:82)
at org.apache.spark.rdd.RDDOperationScope$.<clinit>(RDDOperationScope.scala)

Code references are as below...

SBT file:

ThisBuild / version := "0.1.0"

ThisBuild / scalaVersion := "2.12.12"

lazy val root = (project in file("."))
  .settings(
    name := "spark-code"
  )

lazy val sparkVersion = "3.2.0"

libraryDependencies   = Seq(
  "org.apache.spark" %% "spark-core" % sparkVersion,
  "org.apache.spark" %% "spark-sql" % sparkVersion,
  "org.rogach" %% "scallop" % "4.0.2",
  "com.google.cloud" % "google-cloud-pubsub" % "1.120.11",
  "com.google.cloud" % "google-cloud-bigquery" % "2.15.0",
  "com.google.code.gson" % "gson" % "2.8.9",
  "com.crealytics" %% "spark-excel" % "0.14.0"
)

assemblyMergeStrategy in assembly := {
  case PathList("META-INF", xs @ _*) => MergeStrategy.discard
  case x                             => MergeStrategy.first
}

Spark code:

spark.read
      .format("csv")
      .load("mypath")

To solve this, I tried the following things but non of these worked and exception still persists.

  1. exclude databind dependency from BigQuery's dependency in sbt like this,

    libraryDependencies = Seq("com.google.cloud" % "google-cloud-bigquery" % "2.15.0", exclude ("com.fasterxml.jackson.core", "jackson-core") exclude ("com.fasterxml.jackson.core", "jackson-databind") exclude ("com.fasterxml.jackson.core", "jackson-annotations") )

  2. Exclude dependency and then add explicitly in sbt like this,

    libraryDependencies = Seq("com.google.cloud" % "google-cloud-bigquery" % "2.15.0", exclude ("com.fasterxml.jackson.core", "jackson-core") exclude ("com.fasterxml.jackson.core", "jackson-databind") exclude ("com.fasterxml.jackson.core", "jackson-annotations"), "com.fasterxml.jackson.core" % "jackson-databind" % "2.12.0", "com.fasterxml.jackson.core" % "jackson-core" % "2.12.0" )

  3. Tried to alter various versions of BigQuery dependency such as 2.14.0, 2.13.0, 2.12.0 and 2.10.0

  4. Tried to alter scala version like 2.12.13

Strange thing is If i remove BigQuery's dependency then Spark code works well without any error

So far, nothing worked and i'm still not certain about root cause of the issue. I will really appreciate if i get quick help here. Feel free to suggest things to try out. Thank you in advance!

CodePudding user response:

I identified root cause and came to solution

Issue is that jackson used by Spark and BigQuery is different so I tried to use Spark which uses same jackson used by BigQuery (I can't change Bigquery's version because it is already latest).

I upgraded Spark to 3.3.0 and it worked.

  • Related