Home > Blockchain >  Apache Spark 3.1.2 can't read from S3 via documented spark-hadoop-cloud
Apache Spark 3.1.2 can't read from S3 via documented spark-hadoop-cloud

Time:10-07

The spark docmentation suggests using spark-hadoop-cloud to read / write from S3 in https://spark.apache.org/docs/latest/cloud-integration.html .

There is no apache spark published artifact for spark-hadoop-cloud. Then when trying to use the Cloudera published module the following exception occurs


Exception in thread "main" java.lang.NoSuchMethodError: 'void com.google.common.base.Preconditions.checkArgument(boolean, java.lang.String, java.lang.Object, java.lang.Object)'
at org.apache.hadoop.fs.s3a.S3AUtils.lookupPassword(S3AUtils.java:894)
at org.apache.hadoop.fs.s3a.S3AUtils.lookupPassword(S3AUtils.java:870)
at org.apache.hadoop.fs.s3a.S3AUtils.getEncryptionAlgorithm(S3AUtils.java:1605)
at org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSystem.java:363)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:3303)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:124)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:3352)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:3320)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:479)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:361)
at org.apache.spark.sql.execution.streaming.FileStreamSink$.hasMetadata(FileStreamSink.scala:46)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:377)
at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:325)
at org.apache.spark.sql.DataFrameReader.$anonfun$load$3(DataFrameReader.scala:307)
at scala.Option.getOrElse(Option.scala:189)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:307)
at org.apache.spark.sql.DataFrameReader.json(DataFrameReader.scala:519)
at org.apache.spark.sql.DataFrameReader.json(DataFrameReader.scala:428)

This seems like a classpath conflict. Then it seems like it's not possible to use spark-hadoop-cloud to read with the vanilla apache spark 3.1.2 jars

version := "0.0.1"

scalaVersion := "2.12.12"

lazy val app = (project in file("app")).settings(
    assemblyPackageScala / assembleArtifact := false,
    assembly / assemblyJarName := "uber.jar",
    assembly / mainClass := Some("com.example.Main"),
    // more settings here ...
  )

resolvers  = "Cloudera" at "https://repository.cloudera.com/artifactory/cloudera-repos/"

libraryDependencies  = "org.apache.spark" %% "spark-core" % "3.1.1" % "provided"
libraryDependencies  = "org.apache.spark" %% "spark-sql" % "3.1.1" % "provided"
libraryDependencies  = "org.apache.spark" %% "spark-hadoop-cloud" % "3.1.1.3.1.7270.0-253"
libraryDependencies  = "org.apache.hadoop" % "hadoop-aws" % "3.1.1.7.2.7.0-184"
libraryDependencies  = "org.apache.hadoop" % "hadoop-client" % "3.1.1.7.2.7.0-184"
libraryDependencies  = "com.amazonaws" % "aws-java-sdk-bundle" % "1.11.901"

libraryDependencies  = "com.github.mrpowers" %% "spark-daria" % "0.38.2"
libraryDependencies  = "com.github.mrpowers" %% "spark-fast-tests" % "0.21.3" % "test"
libraryDependencies  = "org.scalatest" %% "scalatest" % "3.0.1" % "test"
import org.apache.spark.sql.SparkSession

object SparkApp {
  def main(args: Array[String]){
    val spark = SparkSession.builder().master("local")
      //.config("spark.jars.repositories", "https://repository.cloudera.com/artifactory/cloudera-repos/")
      //.config("spark.jars.packages", "org.apache.spark:spark-hadoop-cloud_2.12:3.1.1.3.1.7270.0-253")
      .appName("spark session").getOrCreate

    val jsonDF = spark.read.json("s3a://path-to-bucket/compact.json")
    val csvDF = spark.read.format("csv").load("s3a://path-to-bucket/some.csv")
    jsonDF.show()
    csvDF.show()
  }
}

CodePudding user response:

To read and write to S3 from Spark you only need these 2 dependencies:

"org.apache.hadoop" % "hadoop-aws" % hadoopVersion, 
"org.apache.hadoop" % "hadoop-common" % hadoopVersion

Make sure the haddopVersion is the same used by your worker nodes and make sure your workers node also have these dependencies available. The rest of your code looks correct.

  • Related