The requirement is to load csv and parquet files from S3 into a dataframe using PySpark.
The code I'm using is :
from pyspark import SparkConf, SparkContext
from pyspark.sql import SparkSession
conf = SparkConf()
appName = "S3"
master = "local"
conf.set('spark.executor.extraJavaOptions', '-Dcom.amazonaws.services.s3.enableV4=true')
conf.set('spark.driver.extraJavaOptions', '-Dcom.amazonaws.services.s3.enableV4=true')
sc = SparkContext.getOrCreate(conf=conf)
sc.setSystemProperty('com.amazonaws.services.s3.enableV4', 'true')
hadoopConf = sc._jsc.hadoopConfiguration()
hadoopConf.set('fs.s3a.access.key', aws_access_key_id)
hadoopConf.set('fs.s3a.secret.key', aws_secret_access_key)
hadoopConf.set('fs.s3a.impl', 'org.apache.hadoop.fs.s3a.S3AFileSystem')
spark = SparkSession(sc)
df = spark.read.csv('s3://s3path/File.csv')
And it gives me the error :
py4j.protocol.Py4JJavaError: An error occurred while calling o34.csv.
: java.lang.NoClassDefFoundError: org/jets3t/service/S3ServiceException
And similar error while reading Parquet files :
py4j.protocol.Py4JJavaError: An error occurred while calling o34.parquet.
: java.lang.NoClassDefFoundError: org/jets3t/service/S3ServiceException
How to resolve this?
CodePudding user response:
- If something is lookign for jets3t you are using a historically out of date hadoop release which actually supports s3:// urls
- Upgrade to a version of spark with hadoop-3.3.4 or later binaries (whatever is the latest release at the time of reading)
- include the exact same aws-sdk-bundle jar the hadoop-aws jar depends on in its build.
- remove that
hadoopConf.set('fs.s3a.impl', ...
line as that is a weird superstition passed down by stack overflow posts. note how neither the spark nor hadoop documentation examples use it, and consider that authors there knew of what they wrote. - then use s3a:// URLs
CodePudding user response:
You are missing an Hadoop Client dependency - Caused by: java.lang.ClassNotFoundException: org.jets3t.service.ServiceException
Note - Support for AWS S3:// has been deleted in 2016 , as noted by stevel , you should opt for the latest s3a
which you can refer the below link for setup
You need to ensure additional dependent libraries are present before you attempt to read data sources from S3
You can refer this answer as a reference - java.io.IOException: No FileSystem for scheme: s3 to setup your enviornment accordingly