How to run Apache Spark applications with a JAR dependency from AWS S3?-CodePudding

I have a .jar file containing useful functions for my application located in an AWS S3 bucket, and I want to use it as a dependency in Spark without having to first download it locally. Is it possible to directly reference the .jar file with spark-submit (or pyspark) --jars option?

So far, I have tried the following:

spark-shell --packages com.amazonaws:aws-java-sdk:1.12.336,org.apache.hadoop:hadoop-aws:3.3.4 --jars s3a://bucket/path/to/jar/file.jar

The AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY variables are correctly set, since when running the same command without the --jars option, other files in the same bucket are successfully read. However, if the option is added, I get the following error:

Exception in thread "main" java.lang.RuntimeException: java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.s3a.S3AFileSystem not found
    at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2688)
    at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:3431)
    at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:3466)
    at org.apache.hadoop.fs.FileSystem.access$300(FileSystem.java:174)
    at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:3574)
    at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:3521)
    at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:540)
    at org.apache.spark.util.DependencyUtils$.resolveGlobPath(DependencyUtils.scala:317)
    at org.apache.spark.util.DependencyUtils$.$anonfun$resolveGlobPaths$2(DependencyUtils.scala:273)
    at org.apache.spark.util.DependencyUtils$.$anonfun$resolveGlobPaths$2$adapted(DependencyUtils.scala:271)
    at scala.collection.TraversableLike.$anonfun$flatMap$1(TraversableLike.scala:293)
    at scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36)
    at scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33)
    at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:38)
    at scala.collection.TraversableLike.flatMap(TraversableLike.scala:293)
    at scala.collection.TraversableLike.flatMap$(TraversableLike.scala:290)
    at scala.collection.AbstractTraversable.flatMap(Traversable.scala:108)
    at org.apache.spark.util.DependencyUtils$.resolveGlobPaths(DependencyUtils.scala:271)
    at org.apache.spark.deploy.SparkSubmit.$anonfun$prepareSubmitEnvironment$4(SparkSubmit.scala:364)
    at scala.Option.map(Option.scala:230)
    at org.apache.spark.deploy.SparkSubmit.prepareSubmitEnvironment(SparkSubmit.scala:364)
    at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:901)
    at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180)
    at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)
    at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)
    at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1046)
    at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1055)
    at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.s3a.S3AFileSystem not found
    at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2592)
    at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2686)
    ... 27 more

I'm using Spark 3.3.1 pre-built for Apache Hadoop 3.3 and later.

CodePudding user response：

This may be because when in client mode - Spark during its boot distributes the Jars (specified in --jars) via Netty first. To download a remote JAR from a third-party file system (i.e. S3), it'll need the right dependency (i.e. hadoop-aws) in the classpath already (before it prepares the final classpath).

But since it is yet to distribute the JARs it has not prepared the classpath - thus when it tries to download the JAR from s3, it fails with ClassNotFound (as hadoop-aws is yet to be on the classpath), but when doing the same in the application code it succeeds - as by that time the classpath has been resolved.

i.e. Downloading the dependency is dependent on a library that will be loaded later.

CodePudding user response：

To run Apache Spark applications with a JAR dependency from Amazon S3, you can use the --jars command-line option to specify the S3 URL of the JAR file when submitting the Spark application.

For example, if your JAR file is stored in the my-bucket S3 bucket at the jars/my-jar.jar path, you can submit the Spark application as follows:

spark-submit --jars s3a://my-bucket/jars/my-jar.jar \
  --class com.example.MySparkApp \
  s3a://my-bucket/my-spark-app.jar

This will download the JAR file from S3 and add it to the classpath of the Spark application.

Note that you will need to include the s3a:// prefix in the S3 URL to use the s3a filesystem connector, which is the recommended connector for reading from and writing to S3. You may also need to configure the fs.s3a.access.key and fs.s3a.secret.key properties with your AWS access key and secret key in order to authenticate the connection to S3.