Home > Back-end >  Read csv and parquet files from S3 using Pyspark
Read csv and parquet files from S3 using Pyspark

Time:11-01

The requirement is to load csv and parquet files from S3 into a dataframe using PySpark.

The code I'm using is :

from pyspark import SparkConf, SparkContext
from pyspark.sql import SparkSession
conf = SparkConf()
appName = "S3"
master = "local"

conf.set('spark.executor.extraJavaOptions', '-Dcom.amazonaws.services.s3.enableV4=true')
conf.set('spark.driver.extraJavaOptions', '-Dcom.amazonaws.services.s3.enableV4=true')
sc = SparkContext.getOrCreate(conf=conf)
sc.setSystemProperty('com.amazonaws.services.s3.enableV4', 'true')

hadoopConf = sc._jsc.hadoopConfiguration()
hadoopConf.set('fs.s3a.access.key', aws_access_key_id)
hadoopConf.set('fs.s3a.secret.key', aws_secret_access_key)
hadoopConf.set('fs.s3a.impl', 'org.apache.hadoop.fs.s3a.S3AFileSystem')
spark = SparkSession(sc)

df = spark.read.csv('s3://s3path/File.csv')

And it gives me the error :

py4j.protocol.Py4JJavaError: An error occurred while calling o34.csv.
: java.lang.NoClassDefFoundError: org/jets3t/service/S3ServiceException

And similar error while reading Parquet files :

py4j.protocol.Py4JJavaError: An error occurred while calling o34.parquet.
: java.lang.NoClassDefFoundError: org/jets3t/service/S3ServiceException

How to resolve this?

CodePudding user response:

  1. If something is lookign for jets3t you are using a historically out of date hadoop release which actually supports s3:// urls
  2. Upgrade to a version of spark with hadoop-3.3.4 or later binaries (whatever is the latest release at the time of reading)
  3. include the exact same aws-sdk-bundle jar the hadoop-aws jar depends on in its build.
  4. remove that hadoopConf.set('fs.s3a.impl', ... line as that is a weird superstition passed down by stack overflow posts. note how neither the spark nor hadoop documentation examples use it, and consider that authors there knew of what they wrote.
  5. then use s3a:// URLs

CodePudding user response:

You are missing an Hadoop Client dependency - Caused by: java.lang.ClassNotFoundException: org.jets3t.service.ServiceException

Note - Support for AWS S3:// has been deleted in 2016 , as noted by stevel , you should opt for the latest s3a which you can refer the below link for setup

You need to ensure additional dependent libraries are present before you attempt to read data sources from S3

You can refer this answer as a reference - java.io.IOException: No FileSystem for scheme: s3 to setup your enviornment accordingly

  • Related