Home > Back-end >  Read data from s3 using local machine - pyspark
Read data from s3 using local machine - pyspark

Time:12-11

from pyspark.sql import SparkSession
import boto3
import os
import pandas as pd

spark = SparkSession.builder.getOrCreate()

hadoop_conf = spark.sparkContext._jsc.hadoopConfiguration()
hadoop_conf.set("fs.s3.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
hadoop_conf.set("fs.s3a.access.key", "myaccesskey")
hadoop_conf.set("fs.s3a.secret.key", "mysecretkey")
hadoop_conf.set("fs.s3a.endpoint", "s3.amazonaws.com")
hadoop_conf.set("fs.s3a.aws.credentials.provider", "org.apache.hadoop.fs.s3a.BasicAWSCredentialsProvider")
hadoop_conf.set("fs.s3a.connection.ssl.enabled", "true")

conn = boto3.resource("s3", region_name="us-east-1")

df = spark.read.csv("s3a://mani-test-1206/test/test.csv", header=True)
df.show()

spark.stop()

when running above code I had below error: java.io.IOException: From option fs.s3a.aws.credentials.provider java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.s3a.BasicAWSCredentialsProvider not found

Hadoop and aws jars program is using:

spark-hadoop-distribution: spark-3.2.0-bin-hadoop3.2

hadoop jars:
hadoop-annotations-3.2.0.jar
hadoop-auth-3.2.0.jar
hadoop-aws-3.2.0.jar
hadoop-client-api-3.3.1.jar
hadoop-client-runtime-3.3.1.jar
hadoop-common-3.2.0.jar
hadoop-hdfs-3.2.0.jar

aws jars:
aws-java-sdk-1.11.624.jar
aws-java-sdk-core-1.11.624.jar
aws-java-sdk-dynamodb-1.11.624.jar
aws-java-sdk-s3-1.11.624.jar

Any help will be highly appreciated, Thanks.

CodePudding user response:

You didn't set instance profile (one type of IAM roles) properly to the ec2 instance where you execute the codes.

so it has no proper permission to access nominted s3 bucket.

Second, review the java library if it is latest and supports to get aws credential from instance profile.

CodePudding user response:

I had the same problem. What helps me:

  • update hadoop-aws-3.2.0 to 3.2.2 version
  • use "fs.s3a.aws.credentials.provider": "org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider" (it looks name change)
  • Related