I am trying to write data on an S3 bucket from my local computer:
spark = SparkSession.builder \
.appName('application') \
.config("spark.hadoop.fs.s3a.access.key", configuration.AWS_ACCESS_KEY_ID) \
.config("spark.hadoop.fs.s3a.secret.key", configuration.AWS_ACCESS_SECRET_KEY) \
.config("spark.hadoop.fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem") \
.getOrCreate()
lines = spark.readStream \
.format('kafka') \
.option('kafka.bootstrap.servers', kafka_server) \
.option('subscribe', kafka_topic) \
.option("startingOffsets", "earliest") \
.load()
streaming_query = lines.writeStream \
.format('parquet') \
.outputMode('append') \
.option('path', configuration.S3_PATH) \
.start()
streaming_query.awaitTermination()
Hadoop version: 3.2.1, Spark version 3.2.1
I have added the dependency jars to pyspark jars:
spark-sql-kafka-0-10_2.12:3.2.1, aws-java-sdk-s3:1.11.375, hadoop-aws:3.2.1,
I get the following error when executing:
py4j.protocol.Py4JJavaError: An error occurred while calling o68.start.
: java.io.IOException: From option fs.s3a.aws.credentials.provider
java.lang.ClassNotFoundException: Class
org.apache.hadoop.fs.s3a.auth.IAMInstanceCredentialsProvider not found
CodePudding user response:
In my case, it worked in the end by adding the following statement: .config('spark.hadoop.fs.s3a.aws.credentials.provider', 'org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider')
Also, all the hadoop jars in site-package/pyspark/jars must be in the same version, hadoop-aws:3.2.2, hadoop-client-api-3.2.2, hadoop-client-runtime-3.2.2, hadoop-yam-server-web-proxy-3.2.2
For version 3.2.2 of hadoop-aws, aws-java-sdk-s3:1.11.563 package is needed
CodePudding user response:
I used same package with you. in my case, when i added below the line.
config('spark.hadoop.fs.s3a.aws.credentials.provider', 'org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider')
i got this error.
py4j.protocol.Py4JJavaError: An error occurred while calling o56.parquet.
: java.lang.NoSuchMethodError: 'void com.google.common.base.Preconditions.checkArgument(boolean, java.lang.String, java.lang.Object, java.lang.Object)'
....
For solving this I installed `guava-30.0