Home > Back-end >  How to setup PySpark to locally read data from S3 using Hadoop?
How to setup PySpark to locally read data from S3 using Hadoop?

Time:02-08

I followed this blog post which suggests using:

from pyspark import SparkConf
from pyspark.sql import SparkSession
 
conf = SparkConf()
conf.set('spark.jars.packages', 'org.apache.hadoop:hadoop-aws:3.2.0')
conf.set('spark.hadoop.fs.s3a.aws.credentials.provider', 'org.apache.hadoop.fs.s3a.TemporaryAWSCredentialsProvider')
conf.set('spark.hadoop.fs.s3a.access.key', <access_key>)
conf.set('spark.hadoop.fs.s3a.secret.key', <secret_key>)
conf.set('spark.hadoop.fs.s3a.session.token', <token>)
 
spark = SparkSession.builder.config(conf=conf).getOrCreate()

I used it to configure PySpark and it worked to get data from S3 directly from my local machine.


However I found this question about the use of s3a, s3n or s3 and one of the recent answers says advises against using s3a. Also I found this guide from AWS discouraging the use of s3a as well:

Previously, Amazon EMR used the s3n and s3a file systems. While both still work, we recommend that you use the s3 URI scheme for the best performance, security, and reliability.


So I decided to try to look for how to implement the use of s3 with PySpark and Hadoop, but I found this guide from Hadoop mentioning it only supports s3a oficially:

There other Hadoop connectors to S3. Only S3A is actively maintained by the Hadoop project itself.


The mentioned method from the blog post works, but is it the best option for this situation? Is there any other way to configure this?

What would be the best method to access S3 from a local machine?

CodePudding user response:

AWS docs about EMR. your local system is not EMR, so ignore it completely.

Use the ASF-developed s3a connector and look at the hadoop docs on how to use it, in preference to examples from out of date stack overflow posts. {i.e. if the docs say something contradictory to what a 4 y.o. post says, go with the docs. Or even the source)

  •  Tags:  
  • Related