I am creating spark session using below snippet in python notebook on AWS EMR Cluster.
spark = SparkSession.builder \
.config("spark.jars.packages","org.apache.hadoop:hadoop-aws:2.7.0") \
.getOrCreate()
Then reading data from S3 bucket like below
df_songs = spark.read.option("recursiveFileLookup","true") \
.json("s3a://mydata/song_data/", schema=song_schema)
It gives me error :
IllegalArgumentException: For input string: "64M"
Environment : Amazon EMR Service
CodePudding user response:
Hadoop-AWS
module added support for multiple cloud providers version 3.0
onwards.
Hadoop introduced s3a client to read/write from s3 from version 3.0 onwards.
Hadoop’s “S3A” client offers high-performance IO against Amazon S3 object store and compatible implementations.
To fix this issue I'd to create spark instance like with latest version
spark = SparkSession.builder \
.config("spark.jars.packages","org.apache.hadoop:hadoop-aws:3.3.4") \
.getOrCreate()
Picked the latest version 3.x
of hadoop-aws
from mvn repository
Reference : https://hadoop.apache.org/docs/current3/hadoop-aws/tools/hadoop-aws/index.html