Home > database >  SBT run with provided works under the '.' projects but fails with no mercy under any subpr
SBT run with provided works under the '.' projects but fails with no mercy under any subpr

Time:12-30

I'm working with latest sbt.version=1.5.7.

My assembly.sbt is nothing more than addSbtPlugin("com.eed3si9n" % "sbt-assembly" % "1.1.0") .

I have to work with a subprojects due to requirement need.

I am facing the Spark dependencies with provided scope similar to this post: How to work efficiently with SBT, Spark and "provided" dependencies?

As the above post said, I can manage to Compile / run under the root project but fails when Compile / run in the subproject.

Here's my build.sbt detail:

val deps = Seq(
  "org.apache.spark" %% "spark-sql" % "3.1.2" % "provided",
  "org.apache.spark" %% "spark-core" % "3.1.2" % "provided",
  "org.apache.spark" %% "spark-mllib" % "3.1.2" % "provided",
  "org.apache.spark" %% "spark-avro" % "3.1.2" % "provided",
)

val analyticsFrameless =
  (project in file("."))
    .aggregate(sqlChoreography, impressionModelEtl)
    .settings(
      libraryDependencies   = deps
    )

lazy val sqlChoreography =
  (project in file("sql-choreography"))
    .settings(libraryDependencies   = deps)

lazy val impressionModelEtl =
  (project in file("impression-model-etl"))
    // .dependsOn(analytics)
    .settings(
      libraryDependencies   = deps    Seq(
        "com.google.guava" % "guava" % "30.1.1-jre",
        "io.delta" %% "delta-core" % "1.0.0",
        "com.google.cloud.bigdataoss" % "gcs-connector" % "hadoop2-2.1.3"
      )
    )

Compile / run := Defaults
  .runTask(
    Compile / fullClasspath,
    Compile / run / mainClass,
    Compile / run / runner
  )
  .evaluated

impressionModelEtl / Compile / run := Defaults
  .runTask(
    impressionModelEtl / Compile / fullClasspath,
    impressionModelEtl / Compile / run / mainClass,
    impressionModelEtl / Compile / run / runner
  )
  .evaluated

After I execute impressionModelEtl / Compile / run with a simple program:

import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions._

object SparkRead {
  def main(args: Array[String]): Unit = {
    val spark =
      SparkSession
        .builder()
        .master("local[*]")
        .appName("SparkReadTestProvidedScope")
        .getOrCreate()
    spark.stop()
  }
}

, it returns

[error] java.lang.NoClassDefFoundError: org/apache/spark/sql/SparkSession$
[error]         at SparkRead$.main(SparkRead.scala:7)
[error]         at SparkRead.main(SparkRead.scala)
[error]         at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
[error]         at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
[error]         at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
[error]         at java.base/java.lang.reflect.Method.invoke(Method.java:566)
[error] Caused by: java.lang.ClassNotFoundException: org.apache.spark.sql.SparkSession$
[error]         at java.base/java.net.URLClassLoader.findClass(URLClassLoader.java:471)

That baffles me for days. Please help me out...Thanks so much

CodePudding user response:

Please try to add dependsOn

val analyticsFrameless =
  (project in file("."))
    .dependsOn(sqlChoreography, impressionModelEtl)
    .aggregate(sqlChoreography, impressionModelEtl)
    .settings(
      libraryDependencies   = deps
    )

if you are using shred test classes also add

.dependsOn(sqlChoreography % "compile->compile;test->test",
           impressionModelEtl % "compile->compile;test->test")

CodePudding user response:

Finally I figured out a solution. Just separate the build.sbt file in the parent project into its subproject.

Like in ./build.sbt:

import Dependencies._
ThisBuild / trackInternalDependencies := TrackLevel.TrackIfMissing
ThisBuild / exportJars                := true
ThisBuild / scalaVersion              := "2.12.12"
ThisBuild / version                   := "0.0.1"

ThisBuild / Test / parallelExecution := false
ThisBuild / Test / fork              := true
ThisBuild / Test / javaOptions   = Seq(
  "-Xms512M",
  "-Xmx2048M",
  "-XX:MaxPermSize=2048M",
  "-XX: CMSClassUnloadingEnabled"
)

val analyticsFrameless =
  (project in file("."))
    // .dependsOn(sqlChoreography % "compile->compile;test->test", impressionModelEtl % "compile->compile;test->test")
    .settings(
      libraryDependencies   = deps
    )

lazy val sqlChoreography =
  (project in file("sql-choreography"))

lazy val impressionModelEtl =
  (project in file("impression-model-etl"))

While in impression-model-etl dir, create another build.sbt file:

import Dependencies._

lazy val impressionModelEtl =
  (project in file("."))
    .settings(
      libraryDependencies   = deps    Seq(
        "com.google.guava"            % "guava"         % "30.1.1-jre",
        "io.delta"                   %% "delta-core"    % "1.0.0",
        "com.google.cloud.bigdataoss" % "gcs-connector" % "hadoop2-2.1.3"
      )
      // , assembly / assemblyExcludedJars := {
      //   val cp = (assembly / fullClasspath).value
      //   cp filter { _.data.getName == "org.apache.spark" }
      // }
    )

Compile / run := Defaults
  .runTask(
    Compile / fullClasspath,
    Compile / run / mainClass,
    Compile / run / runner
  )
  .evaluated

assembly / assemblyOption := (assembly / assemblyOption).value.withIncludeBin(false)

assembly / assemblyJarName := s"${name.value}_${scalaBinaryVersion.value}-${sparkVersion}_${version.value}.jar"

name := "impression"

And be sure to extract common Spark libraries to the parent project dir, with a Dependencies.scala file:

import sbt._

object Dependencies {
  // Versions
  lazy val sparkVersion = "3.1.2"

  val deps = Seq(
    "org.apache.spark"       %% "spark-sql"                        % sparkVersion             % "provided",
    "org.apache.spark"       %% "spark-core"                       % sparkVersion             % "provided",
    "org.apache.spark"       %% "spark-mllib"                      % sparkVersion             % "provided",
    "org.apache.spark"       %% "spark-avro"                       % sparkVersion             % "provided",
    ...
  )
}

And with all these steps done, it's normal to run Spark code locally in the subproject folder whilst setting Spark dependencies as "provided".

  • Related