I am trying to write some Apache Spark code in Scala, but I am stuck in 'Provider Hell'
import org.apache.spark
import org.apache.spark.ml.fpm.FPGrowth
@main def example1() = {
println("Example 1")
val dataset = spark.createDataset(Seq(
"1 2 5",
"1 2 3 5",
"1 2")
).map(t => t.split(" ")).toDF("items")
val fpGrowth = new FPGrowth().setItemsCol("items").setMinSupport(0.5).setMinConfidence(0.6)
val model = fpGrowth.fit(dataset)
// Display frequent itemsets.
model.freqItemsets.show()
// Display generated association rules.
model.associationRules.show()
// transform examines the input items against all the association rules and summarize the
// consequents as prediction
model.transform(dataset).show()
}
When I do try to compile this I get
sbt:fp-laboratory> compile
[info] compiling 2 Scala sources to /Users/eric.kolotyluk/git/autonomous-iam/poc/fp-laboratory/target/scala-3.1.2/classes ...
[error] -- [E008] Not Found Error: /Users/eric.kolotyluk/git/autonomous-iam/poc/fp-laboratory/src/main/scala/Example1.scala:7:22
[error] 7 | val dataset = spark.createDataset(Seq(
[error] | ^^^^^^^^^^^^^^^^^^^
[error] | value createDataset is not a member of org.apache.spark
[error] one error found
[error] (Compile / compileIncremental) Compilation failed
[error] Total time: 3 s, completed May 30, 2022, 2:13:18 PM
but this is not really the problem. IntelliJ complains about line 1 of my code import org.apache.spark
highlighting import
such that when I mouse over import
I get
'/private/var/folders/h0/9w1gfn9j1qvgs5b_q9c16bj40000gp/T/fp-laboratory-fp-laboratory-target' does not exist or is not a directory or .jar file
which means absolutely nothing to me. I have no idea why it's looking for that. However, looking at my build.sbt
file
ThisBuild / organization := "com.forgerock"
ThisBuild / scalaVersion := "3.1.2"
ThisBuild / version := "0.1.0-SNAPSHOT"
lazy val root = (project in file("."))
.settings(
name := "fp-laboratory",
libraryDependencies = Seq(
("org.apache.spark" %% "spark-mllib" % "3.2.1" % "provided").cross(CrossVersion.for3Use2_13),
("org.apache.spark" %% "spark-sql" % "3.2.0" % "provided").cross(CrossVersion.for3Use2_13)
)
)
// include the 'provided' Spark dependency on the classpath for `sbt run`
Compile / run := Defaults.runTask(Compile / fullClasspath, Compile / run / mainClass, Compile / run / runner).evaluated
When I remove "provided"
from the build.sbt file, IntelliJ stops complaining about
'/private/var/folders/h0/9w1gfn9j1qvgs5b_q9c16bj40000gp/T/fp-laboratory-fp-laboratory-target' does not exist or is not a directory or .jar file
and only complains about value createDataset is not a member of org.apache.spark
I suspect there is something fishy about provided
scope, as until now, I have zero experience with this.
Consequently, Provider Hell.
My Scala code runs fine when I use
/opt/homebrew/Cellar/apache-spark/3.2.1/bin/spark-shell
So, I suspect there is some super secret magic required to run a Scala program under IntelliJ or SBT to say where spark-mllib
and spark-sql
are, but I am no wizard.
Can someone please tell me what the provider
magic is so I can get out of hell.
CodePudding user response:
Well actually, that's not related to the provider
scope. As you also mentioned, the code works fine when you use spark-shell, right? That's because in spark-shell, there's a predefined value named spark
, which is an object of type org.apache.spark.sql.SparkSession
, and actually has a method called createDataset
, but in your code in IntelliJ, the "spark" is referred to as org.apache.spark
(imported right above), which is a package that does not have a method called createDataset
. Do you see where I'm going? Try defining a variable of type SparkSession
and use it in your code, something like this:
import org.apache.spark.ml.fpm.FPGrowth
org.apache.spark.sql.SparkSession
@main def example1() = {
println("Example 1")
val spark = SparkSession.builder()
/* do your configurations here, like setting master, ... */
.getOrCreate()
val dataset = spark.createDataset(Seq(
"1 2 5",
"1 2 3 5",
"1 2")
).map(t => t.split(" ")).toDF("items")
// other stuff
}
Now this should work fine.
CodePudding user response:
Okay, it was not Provider Hell, although I still don't understand this well, but I was not setting up my Spark app properly, as I did not realize that spark-shell
automatically does a lot of work...
In my code I needed to explicitly create a SparkSession first...
def newSparkSession = SparkSession
.builder
.appName("Simple Application")
.config("spark.master", "local")
.getOrCreate()
then
Using(newSparkSession) { sparkSession =>
import sparkSession.implicits._
val dataset = sparkSession.createDataset(Seq(
"1 2 5",
"1 2 3 5",
"1 2")
).map(t => t.split(" ")).toDF("items")
val fpGrowth = new FPGrowth().setItemsCol("items").setMinSupport(0.5).setMinConfidence(0.6)
val model = fpGrowth.fit(dataset)
// Display frequent itemsets.
model.freqItemsets.show()
// Display generated association rules.
model.associationRules.show()
// transform examines the input items against all the association rules and summarize the
// consequents as prediction
model.transform(dataset).show()
}
Also, the correct build.sbt is
lazy val root = (project in file("."))
.settings(
name := "fp-laboratory",
libraryDependencies = Seq(
("org.apache.spark" %% "spark-mllib" % "3.2.1" % "provided").cross(CrossVersion.for3Use2_13),
("org.apache.spark" %% "spark-sql" % "3.2.1" % "provided").cross(CrossVersion.for3Use2_13)
),
Compile / run := Defaults.runTask(Compile / fullClasspath, Compile / run / mainClass, Compile / run / runner).evaluated
)
Now I am onto my next Scala problem...