IllegalArgumentException : Creating Spark Session-CodePudding

I am creating spark session using below snippet in python notebook on AWS EMR Cluster.

spark = SparkSession.builder \
                     .config("spark.jars.packages","org.apache.hadoop:hadoop-aws:2.7.0") \
                     .getOrCreate()

Then reading data from S3 bucket like below

df_songs = spark.read.option("recursiveFileLookup","true") \
                .json("s3a://mydata/song_data/", schema=song_schema)

It gives me error :

IllegalArgumentException: For input string: "64M"

Environment : Amazon EMR Service

CodePudding user response：

Hadoop-AWS module added support for multiple cloud providers version 3.0 onwards.

Hadoop introduced s3a client to read/write from s3 from version 3.0 onwards.

Hadoop’s “S3A” client offers high-performance IO against Amazon S3 object store and compatible implementations.

To fix this issue I'd to create spark instance like with latest version

spark = SparkSession.builder \
                     .config("spark.jars.packages","org.apache.hadoop:hadoop-aws:3.3.4") \
                     .getOrCreate()

Picked the latest version 3.x of hadoop-aws from mvn repository

Reference : https://hadoop.apache.org/docs/current3/hadoop-aws/tools/hadoop-aws/index.html