How do I get out of Provider Hell?-CodePudding

I am trying to write some Apache Spark code in Scala, but I am stuck in 'Provider Hell'

import org.apache.spark
import org.apache.spark.ml.fpm.FPGrowth

@main def example1() = {
  println("Example 1")

  val dataset = spark.createDataset(Seq(
    "1 2 5",
    "1 2 3 5",
    "1 2")
  ).map(t => t.split(" ")).toDF("items")

  val fpGrowth = new FPGrowth().setItemsCol("items").setMinSupport(0.5).setMinConfidence(0.6)
  val model = fpGrowth.fit(dataset)

  // Display frequent itemsets.
  model.freqItemsets.show()

  // Display generated association rules.
  model.associationRules.show()

  // transform examines the input items against all the association rules and summarize the
  // consequents as prediction
  model.transform(dataset).show()
}

When I do try to compile this I get

sbt:fp-laboratory> compile
[info] compiling 2 Scala sources to /Users/eric.kolotyluk/git/autonomous-iam/poc/fp-laboratory/target/scala-3.1.2/classes ...
[error] -- [E008] Not Found Error: /Users/eric.kolotyluk/git/autonomous-iam/poc/fp-laboratory/src/main/scala/Example1.scala:7:22 
[error] 7 |  val dataset = spark.createDataset(Seq(
[error]   |                ^^^^^^^^^^^^^^^^^^^
[error]   |                value createDataset is not a member of org.apache.spark
[error] one error found
[error] (Compile / compileIncremental) Compilation failed
[error] Total time: 3 s, completed May 30, 2022, 2:13:18 PM

but this is not really the problem. IntelliJ complains about line 1 of my code import org.apache.spark highlighting import such that when I mouse over import I get

'/private/var/folders/h0/9w1gfn9j1qvgs5b_q9c16bj40000gp/T/fp-laboratory-fp-laboratory-target' does not exist or is not a directory or .jar file

which means absolutely nothing to me. I have no idea why it's looking for that. However, looking at my build.sbt file

ThisBuild / organization := "com.forgerock"
ThisBuild / scalaVersion := "3.1.2"
ThisBuild / version      := "0.1.0-SNAPSHOT"

lazy val root = (project in file("."))
  .settings(
    name := "fp-laboratory",
    libraryDependencies   = Seq(
      ("org.apache.spark" %% "spark-mllib" % "3.2.1" % "provided").cross(CrossVersion.for3Use2_13),
      ("org.apache.spark" %% "spark-sql" % "3.2.0" % "provided").cross(CrossVersion.for3Use2_13)
   )
 )

// include the 'provided' Spark dependency on the classpath for `sbt run`
Compile / run := Defaults.runTask(Compile / fullClasspath, Compile / run / mainClass, Compile / run / runner).evaluated

When I remove "provided" from the build.sbt file, IntelliJ stops complaining about

'/private/var/folders/h0/9w1gfn9j1qvgs5b_q9c16bj40000gp/T/fp-laboratory-fp-laboratory-target' does not exist or is not a directory or .jar file

and only complains about value createDataset is not a member of org.apache.spark

I suspect there is something fishy about provided scope, as until now, I have zero experience with this.

Consequently, Provider Hell.

My Scala code runs fine when I use

/opt/homebrew/Cellar/apache-spark/3.2.1/bin/spark-shell

So, I suspect there is some super secret magic required to run a Scala program under IntelliJ or SBT to say where spark-mllib and spark-sql are, but I am no wizard.

Can someone please tell me what the provider magic is so I can get out of hell.

CodePudding user response：

Well actually, that's not related to the provider scope. As you also mentioned, the code works fine when you use spark-shell, right? That's because in spark-shell, there's a predefined value named spark, which is an object of type org.apache.spark.sql.SparkSession, and actually has a method called createDataset, but in your code in IntelliJ, the "spark" is referred to as org.apache.spark (imported right above), which is a package that does not have a method called createDataset. Do you see where I'm going? Try defining a variable of type SparkSession and use it in your code, something like this:

import org.apache.spark.ml.fpm.FPGrowth
org.apache.spark.sql.SparkSession

@main def example1() = {
  println("Example 1")

  val spark = SparkSession.builder()
    /* do your configurations here, like setting master, ... */
    .getOrCreate()

  val dataset = spark.createDataset(Seq(
    "1 2 5",
    "1 2 3 5",
    "1 2")
  ).map(t => t.split(" ")).toDF("items")

  // other stuff
}

Now this should work fine.

CodePudding user response：

Okay, it was not Provider Hell, although I still don't understand this well, but I was not setting up my Spark app properly, as I did not realize that spark-shell automatically does a lot of work...

In my code I needed to explicitly create a SparkSession first...

def newSparkSession = SparkSession
  .builder
  .appName("Simple Application")
  .config("spark.master", "local")
  .getOrCreate()

then

Using(newSparkSession) { sparkSession =>

  import sparkSession.implicits._

  val dataset = sparkSession.createDataset(Seq(
    "1 2 5",
    "1 2 3 5",
    "1 2")
  ).map(t => t.split(" ")).toDF("items")

  val fpGrowth = new FPGrowth().setItemsCol("items").setMinSupport(0.5).setMinConfidence(0.6)
  val model = fpGrowth.fit(dataset)

  // Display frequent itemsets.
  model.freqItemsets.show()

  // Display generated association rules.
  model.associationRules.show()

  // transform examines the input items against all the association rules and summarize the
  // consequents as prediction
  model.transform(dataset).show()
}

Also, the correct build.sbt is

lazy val root = (project in file("."))
  .settings(
    name := "fp-laboratory",
    libraryDependencies   = Seq(
      ("org.apache.spark" %% "spark-mllib" % "3.2.1" % "provided").cross(CrossVersion.for3Use2_13),
      ("org.apache.spark" %% "spark-sql"   % "3.2.1" % "provided").cross(CrossVersion.for3Use2_13)
    ),
    Compile / run := Defaults.runTask(Compile / fullClasspath, Compile / run / mainClass, Compile / run / runner).evaluated
  )

Now I am onto my next Scala problem...