I just started learning to use spark and AWS. I have configured my spark session as follows:
spark = SparkSession.builder\
.config("spark.jars.packages", "org.apache.hadoop:hadoop-aws:3.2.0") \
.config("spark.master", "local") \
.config("spark.app.name", "S3app") \
.getOrCreate()
spark._jsc.hadoopConfiguration().set("fs.s3a.access.key", os.environ["AWS_ACCESS_KEY_ID"])
spark._jsc.hadoopConfiguration().set("fs.s3a.secret.key", os.environ["AWS_SECRET_ACCESS_KEY"])
I would like to read the data from S3a interface using pyspark as follows:
df = spark.read.csv("s3a://some_container/some_csv.csv")
But I keep getting java.lang.ClassNotFoundException
:
2/07/11 00:48:01 WARN MetricsConfig: Cannot locate configuration: tried hadoop-metrics2-s3a-file-system.properties,hadoop-metrics2.properties
22/07/11 00:48:01 WARN FileSystem: Failed to initialize fileystem s3a://udacity-dend/sparkbyexamples/csv/zipcodes.csv: java.io.IOException: From option fs.s3a.aws.credentials.provider java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.s3a.auth.IAMInstanceCredentialsProvider not found
After googling, I think this is because I have not configured spark properly. How should I fix this?
CodePudding user response:
A lot more goes under the hood for establishing a connection to an external file system
You can go through this link for an explanation towards this