py4j.protocol.Py4JJavaError: An error occurred while calling o36.partitions.: java.lang.NumberFormat-CodePudding

I am trying to read compressed log file from S3 bucket using pyspark on EC2 instance. EC2 instance has read permission to S3 bucket as I am able to manually download the file using AWS CLI command.

This is how my code looks like

file_path= 's3a://<bucket_name>/<path_of_file>'

rdd1 = sc.textFile(file_path)

rdd1.take(3)

But I am getting below error

*py4j.protocol.Py4JJavaError: An error occurred while calling o36.partitions.
: java.lang.NumberFormatException: For input string: "64M"*

Can somebody help me out?

CodePudding user response：

you are mixing versions of hadoop-common with an older version of hadoop-aws.

the s3a connector added support for using a unit when declaring multipart block size in 2016, eight years ago, in https://issues.apache.org/jira/browse/HADOOP-13680.

hadoop-common JAR versions 2.8 set it to "64M"

if the version of the s3a connector you are using can't cope with that, it means it is nine years old

please

upgrade your hadoop-* jars to a recent version, ideally 3.3.0
make sure they are all the same version unless you enjoy seeing stack traces
and use the exact same aws-sdk-bundle jar which hadoop was built with unless you want to see different stack traces.

thisis not an opinion, these are instructions from the hadoop-aws maintenance team.

CodePudding user response：

It seems that you are either missing the JAR file which is supposed to handle the data type conversion or you haven't set the argument.

conf = SparkConf().set('spark.executor.extraJavaOptions', '-Dcom.amazonaws.services.s3.enableV4=true').set('spark.driver.extraJavaOptions', '-Dcom.amazonaws.services.s3.enableV4=true'). \
    setAppName('pyspark_aws').setMaster('local[*]') #setting appname, connect to localhost
conf.set('spark.jars.packages', 'org.apache.hadoop:hadoop-aws:3.2.0,org.apache.hadoop:hadoop-client:3.2.0,org.apache.hadoop:hadoop-common:3.2.0') #download jar files
conf.set("spark.hadoop.mapred.output.compress", "true")
conf.set("spark.hadoop.mapred.output.compression.codec", "true")
conf.set("spark.hadoop.mapred.output.compression.codec",
         "org.apache.hadoop.io.compress.GzipCodec") #these lines compress files to gzip
conf.set("spark.hadoop.mapred.output.compression.`type", "BLOCK")
conf.set("spark.speculation", "false")

The spark.jars.packages will download the JAR files needed for writing to S3. Hadoop AWS, Hadoop Client, Hadoop Common.

Below Configuration is what most probably caused your error. It is used to manage data type sizes.

conf.set("fs.s3a.multipart.size", "104857600")

Setting file system and keys for authorization

sc = SparkContext(conf=conf)
sc.setSystemProperty('com.amazonaws.services.s3.enableV4', 'true')
hadoopConf = sc._jsc.hadoopConfiguration()
hadoopConf.set('fs.s3n.awsAccessKeyId', 'my access key id')
hadoopConf.set('fs.s3n.awsSecretAccessKey',
               'my access key')
hadoopConf.set('fs.s3a.endpoint', 's3.amazonaws.com')
hadoopConf.set('fs.s3a.impl', 'org.apache.hadoop.fs.s3a.S3AFileSystem')

spark = SparkSession(sc)
df.write.parquet('s3a://myfolder/myfolderwhereGzipWillUpload', mode='overwrite')

Your files will be uploaded as GZIP's on your S3. You can change the settings to your liking.