Spark ignoring package jars included in the configuration of my Spark Session-CodePudding

I keep running into a java.lang.ClassNotFoundException: Failed to find data source: iceberg. Please find packages at https://spark.apache.org/third-party-projects.html error.

I am trying to include the org.apache.iceberg:iceberg-spark-runtime-3.2_2.12:1.1.0 package as part of my spark code. The reason is that I want it to write unit tests locally. I have tried several things:

Include the package as part of my SparkSession builder:

   val conf = new SparkConf()
   conf.set("spark.jars.packages", "org.apache.iceberg:iceberg-spark-runtime-3.2_2.12:1.1.0")
   val sparkSession: SparkSession = 
   SparkSession
      .builder()
      .appName(getClass.getSimpleName)
      .config(conf = conf)
   //   ... the rest of my config
      .master("local[*]").getOrCreate()

and it does not work, I get the same error. I also tried directly using the configuration string in the sparksession builder and that didn't work either.

Downloading the jar myself. I really don't want to do this, I want it to be automated. But even this, I cannot specify "spark.jars" to point to the downloaded jar, it cannot find it for some reason.

Can anybody help me figure this out?

CodePudding user response：

You can create a uber/fat jar and put all your dependencies in that jar.

Lets say if you want to use iceberg in your spark application.

Create a pom.xml file and add the dependency in include section.

<dependencies>
    <dependency>
      <groupId>org.apache.iceberg</groupId>
      <artifactId>iceberg-spark-runtime-3.2_2.12</artifactId>
      <version>4.12</version>
    </dependency>
</dependencies>

It will create a fat jar along with that dependency baked in it. you can deploy that jar via spark-submit and the dependent libraries will be picked automatically.

CodePudding user response：

It seems spark.jars.packages is only read when spark-shell starts up. That means it can be changed in the spark-shell session via SparkSession or SparkConf, however, it will not be processed or loaded.

For a Self-Contained Scala Application, you may used to add the following dependencies in the build.sbt:

libraryDependencies   = Seq(
  "org.mongodb.spark" %% "mongo-spark-connector" % "10.0.5",
  "org.apache.spark" %% "spark-core" % "3.0.2",
  "org.apache.spark" %% "spark-sql" % "3.0.2"
)