I have installed SPARK version 3.3.1 using Scala 2.12.15 on MacOS. The homebrew installation also added also OpenJDK 64-Bit Server VM, 19.0.1.
Currently using Python 3.9
Env variables:
export JAVA_HOME=/usr/local/Cellar/openjdk/19.0.1/libexec/openjdk.jdk/Contents/Home
export SPARK_HOME=/usr/local/Cellar/apache-spark/3.3.1/libexec
export SPARK_LOCAL_DIRS=$HOME/tmp/spark
export PYSPARK_PYTHON=/usr/local/bin/python3.9
Code
...
conf = SparkConf()
conf.set('spark.jars.packages', 'org.apache.hadoop:hadoop-aws:3.3.4')
conf.set('fs.s3a.aws.credentials.provider','org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider')
#conf.set('spark.hadoop.fs.s3a.impl', 'org.apache.hadoop.fs.s3a.S3AFileSystem')
conf.set('spark.hadoop.fs.s3a.access.key', aws_source["access_key_id"])
conf.set('spark.hadoop.fs.s3a.secret.key', aws_source["secret_access_key"])
conf.set('spark.hadoop.fs.s3a.endpoint', aws_source["host"])
conf.set('s3bucket', aws_source['bucket'])
spark = SparkSession.builder.config(conf=conf).getOrCreate()
sc = spark.sparkContext
s3folder = f"s3a://{conf.get('s3bucket')}/spark/fashion/sales"
df = spark.read.options(header='true', inferSchema='true').csv(s3folder)
Terminal Output
/usr/local/bin/python3.9 /Users/d051079/Library/CloudStorage/OneDrive-SAPSE/GitHub/sparkcheck/thhspark/connections.py
Warning: Ignoring non-Spark config property: s3bucket
Warning: Ignoring non-Spark config property: fs.s3a.aws.credentials.provider
:: loading settings :: url = jar:file:/usr/local/Cellar/apache-spark/3.3.1/libexec/jars/ivy-2.5.0.jar!/org/apache/ivy/core/settings/ivysettings.xml
Ivy Default Cache set to: /Users/myuser/.ivy2/cache
The jars for the packages stored in: /Users/muuser/.ivy2/jars
org.apache.hadoop#hadoop-aws added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-4f584447-37e3-49b8-880f-01011a577f68;1.0
confs: [default]
found org.apache.hadoop#hadoop-aws;3.3.4 in central
found com.amazonaws#aws-java-sdk-bundle;1.12.262 in central
found org.wildfly.openssl#wildfly-openssl;1.0.7.Final in central
:: resolution report :: resolve 172ms :: artifacts dl 9ms
:: modules in use:
com.amazonaws#aws-java-sdk-bundle;1.12.262 from central in [default]
org.apache.hadoop#hadoop-aws;3.3.4 from central in [default]
org.wildfly.openssl#wildfly-openssl;1.0.7.Final from central in [default]
---------------------------------------------------------------------
| | modules || artifacts |
| conf | number| search|dwnlded|evicted|| number|dwnlded|
---------------------------------------------------------------------
| default | 3 | 0 | 0 | 0 || 3 | 0 |
---------------------------------------------------------------------
:: retrieving :: org.apache.spark#spark-submit-parent-4f584447-37e3-49b8-880f-01011a577f68
confs: [default]
0 artifacts copied, 3 already retrieved (0kB/10ms)
22/12/16 15:16:00 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Traceback (most recent call last):
File "/Users/myuser/Library/CloudStorage/OneDrive-SAPSE/GitHub/sparkcheck/thhspark/connections.py", line 44, in <module>
df = spark.read.options(header='true', inferSchema='true').csv(s3folder)
File "/usr/local/Cellar/apache-spark/3.3.1/libexec/python/pyspark/sql/readwriter.py", line 535, in csv
return self._df(self._jreader.csv(self._spark._sc._jvm.PythonUtils.toSeq(path)))
File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/py4j/java_gateway.py", line 1321, in __call__
return_value = get_return_value(
File "/usr/local/Cellar/apache-spark/3.3.1/libexec/python/pyspark/sql/utils.py", line 190, in deco
return f(*a, **kw)
File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/py4j/protocol.py", line 326, in get_return_value
raise Py4JJavaError(
py4j.protocol.Py4JJavaError: An error occurred while calling o43.csv.
: java.lang.NoClassDefFoundError: com/amazonaws/AmazonClientException
at java.base/java.lang.Class.forName0(Native Method)
at java.base/java.lang.Class.forName(Class.java:398)
at org.apache.hadoop.conf.Configuration.getClassByNameOrNull(Configuration.java:2625)
at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2590)
at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2686)
at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:3431)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:3466)
at org.apache.hadoop.fs.FileSystem.access$300(FileSystem.java:174)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:3574)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:3521)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:540)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:365)
at org.apache.spark.sql.execution.streaming.FileStreamSink$.hasMetadata(FileStreamSink.scala:53)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:370)
at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:228)
at org.apache.spark.sql.DataFrameReader.$anonfun$load$2(DataFrameReader.scala:210)
at scala.Option.getOrElse(Option.scala:189)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:210)
at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:537)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.base/java.lang.reflect.Method.invoke(Method.java:566)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
at py4j.ClientServerConnection.run(ClientServerConnection.java:106)
at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: java.lang.ClassNotFoundException: com.amazonaws.AmazonClientException
at java.base/jdk.internal.loader.BuiltinClassLoader.loadClass(BuiltinClassLoader.java:581)
at java.base/jdk.internal.loader.ClassLoaders$AppClassLoader.loadClass(ClassLoaders.java:178)
at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:522)
... 31 more
I presumed that the py4j might be to blame and tried to replace the SPARK openjdk19 with openjdk@11 by redirecting JAVA_HOME. Firstly pyspark does not care and kept using jdk19 and secondly it does not help.
Because in my jars I have found that the hadoop versions are: org.apache.hadoop:hadoop-aws:3.3.2 but with no avail.
The Troubleshooting page was of no help: https://hadoop.apache.org/docs/stable/hadoop-aws/tools/hadoop-aws/troubleshooting_s3a.html
CodePudding user response:
It seems like you already have the hadoop-aws
jar in your classpath, good start!
com.amazonaws.AmazonClientException
comes from the aws-java-sdk-bundle
jar though, so you'll need to add that to your classpath. You can grab the version you want from Maven Repository.
Hope this helps!