I'm working with latest sbt.version=1.5.7
.
My assembly.sbt
is nothing more than addSbtPlugin("com.eed3si9n" % "sbt-assembly" % "1.1.0")
.
I have to work with a subprojects due to requirement need.
I am facing the Spark
dependencies with provided
scope similar to this post: How to work efficiently with SBT, Spark and "provided" dependencies?
As the above post said, I can manage to Compile / run
under the root project but fails when Compile / run
in the subproject.
Here's my build.sbt
detail:
val deps = Seq(
"org.apache.spark" %% "spark-sql" % "3.1.2" % "provided",
"org.apache.spark" %% "spark-core" % "3.1.2" % "provided",
"org.apache.spark" %% "spark-mllib" % "3.1.2" % "provided",
"org.apache.spark" %% "spark-avro" % "3.1.2" % "provided",
)
val analyticsFrameless =
(project in file("."))
.aggregate(sqlChoreography, impressionModelEtl)
.settings(
libraryDependencies = deps
)
lazy val sqlChoreography =
(project in file("sql-choreography"))
.settings(libraryDependencies = deps)
lazy val impressionModelEtl =
(project in file("impression-model-etl"))
// .dependsOn(analytics)
.settings(
libraryDependencies = deps Seq(
"com.google.guava" % "guava" % "30.1.1-jre",
"io.delta" %% "delta-core" % "1.0.0",
"com.google.cloud.bigdataoss" % "gcs-connector" % "hadoop2-2.1.3"
)
)
Compile / run := Defaults
.runTask(
Compile / fullClasspath,
Compile / run / mainClass,
Compile / run / runner
)
.evaluated
impressionModelEtl / Compile / run := Defaults
.runTask(
impressionModelEtl / Compile / fullClasspath,
impressionModelEtl / Compile / run / mainClass,
impressionModelEtl / Compile / run / runner
)
.evaluated
After I execute impressionModelEtl / Compile / run
with a simple program:
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions._
object SparkRead {
def main(args: Array[String]): Unit = {
val spark =
SparkSession
.builder()
.master("local[*]")
.appName("SparkReadTestProvidedScope")
.getOrCreate()
spark.stop()
}
}
, it returns
[error] java.lang.NoClassDefFoundError: org/apache/spark/sql/SparkSession$
[error] at SparkRead$.main(SparkRead.scala:7)
[error] at SparkRead.main(SparkRead.scala)
[error] at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
[error] at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
[error] at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
[error] at java.base/java.lang.reflect.Method.invoke(Method.java:566)
[error] Caused by: java.lang.ClassNotFoundException: org.apache.spark.sql.SparkSession$
[error] at java.base/java.net.URLClassLoader.findClass(URLClassLoader.java:471)
That baffles me for days. Please help me out...Thanks so much
CodePudding user response:
Please try to add dependsOn
val analyticsFrameless =
(project in file("."))
.dependsOn(sqlChoreography, impressionModelEtl)
.aggregate(sqlChoreography, impressionModelEtl)
.settings(
libraryDependencies = deps
)
if you are using shred test classes also add
.dependsOn(sqlChoreography % "compile->compile;test->test",
impressionModelEtl % "compile->compile;test->test")
CodePudding user response:
Finally I figured out a solution. Just separate the build.sbt
file in the parent project into its subproject.
Like in ./build.sbt
:
import Dependencies._
ThisBuild / trackInternalDependencies := TrackLevel.TrackIfMissing
ThisBuild / exportJars := true
ThisBuild / scalaVersion := "2.12.12"
ThisBuild / version := "0.0.1"
ThisBuild / Test / parallelExecution := false
ThisBuild / Test / fork := true
ThisBuild / Test / javaOptions = Seq(
"-Xms512M",
"-Xmx2048M",
"-XX:MaxPermSize=2048M",
"-XX: CMSClassUnloadingEnabled"
)
val analyticsFrameless =
(project in file("."))
// .dependsOn(sqlChoreography % "compile->compile;test->test", impressionModelEtl % "compile->compile;test->test")
.settings(
libraryDependencies = deps
)
lazy val sqlChoreography =
(project in file("sql-choreography"))
lazy val impressionModelEtl =
(project in file("impression-model-etl"))
While in impression-model-etl
dir, create another build.sbt
file:
import Dependencies._
lazy val impressionModelEtl =
(project in file("."))
.settings(
libraryDependencies = deps Seq(
"com.google.guava" % "guava" % "30.1.1-jre",
"io.delta" %% "delta-core" % "1.0.0",
"com.google.cloud.bigdataoss" % "gcs-connector" % "hadoop2-2.1.3"
)
// , assembly / assemblyExcludedJars := {
// val cp = (assembly / fullClasspath).value
// cp filter { _.data.getName == "org.apache.spark" }
// }
)
Compile / run := Defaults
.runTask(
Compile / fullClasspath,
Compile / run / mainClass,
Compile / run / runner
)
.evaluated
assembly / assemblyOption := (assembly / assemblyOption).value.withIncludeBin(false)
assembly / assemblyJarName := s"${name.value}_${scalaBinaryVersion.value}-${sparkVersion}_${version.value}.jar"
name := "impression"
And be sure to extract common Spark libraries to the parent project
dir, with a Dependencies.scala
file:
import sbt._
object Dependencies {
// Versions
lazy val sparkVersion = "3.1.2"
val deps = Seq(
"org.apache.spark" %% "spark-sql" % sparkVersion % "provided",
"org.apache.spark" %% "spark-core" % sparkVersion % "provided",
"org.apache.spark" %% "spark-mllib" % sparkVersion % "provided",
"org.apache.spark" %% "spark-avro" % sparkVersion % "provided",
...
)
}
And with all these steps done, it's normal to run Spark code locally in the subproject folder whilst setting Spark dependencies as "provided".