How to configure pyspark to access AWS S3 containers?-CodePudding

I just started learning to use spark and AWS. I have configured my spark session as follows:

spark = SparkSession.builder\
                     .config("spark.jars.packages", "org.apache.hadoop:hadoop-aws:3.2.0") \
                     .config("spark.master", "local") \
                     .config("spark.app.name", "S3app") \
                     .getOrCreate()
spark._jsc.hadoopConfiguration().set("fs.s3a.access.key", os.environ["AWS_ACCESS_KEY_ID"])
spark._jsc.hadoopConfiguration().set("fs.s3a.secret.key", os.environ["AWS_SECRET_ACCESS_KEY"])

I would like to read the data from S3a interface using pyspark as follows:

df = spark.read.csv("s3a://some_container/some_csv.csv")

But I keep getting java.lang.ClassNotFoundException:

2/07/11 00:48:01 WARN MetricsConfig: Cannot locate configuration: tried hadoop-metrics2-s3a-file-system.properties,hadoop-metrics2.properties
22/07/11 00:48:01 WARN FileSystem: Failed to initialize fileystem s3a://udacity-dend/sparkbyexamples/csv/zipcodes.csv: java.io.IOException: From option fs.s3a.aws.credentials.provider java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.s3a.auth.IAMInstanceCredentialsProvider not found

After googling, I think this is because I have not configured spark properly. How should I fix this?

CodePudding user response：

A lot more goes under the hood for establishing a connection to an external file system

You can go through this link for an explanation towards this