Amazon EMR pyspark unable to read a JSON file-CodePudding

I am new to EMR and Bigdata,

We have an EMR step and that was working fine till last month, currently I am getting the below error.

--- Logging error ---
Traceback (most recent call last):
  File "/mnt/yarn/usercache/hadoop/appcache/application_1660495066893_0006/container_1660495066893_0006_01_000001/src.zip/src/source/Data_Extraction.py", line 59, in process_job_description
    df_job_desc = spark.read.schema(schema_jd).option('multiline',"true").json(self.filepath)
  File "/mnt/yarn/usercache/hadoop/appcache/application_1660495066893_0006/container_1660495066893_0006_01_000001/pyspark.zip/pyspark/sql/readwriter.py", line 274, in json
    return self._df(self._jreader.json(self._spark._sc._jvm.PythonUtils.toSeq(path)))
  File "/mnt/yarn/usercache/hadoop/appcache/application_1660495066893_0006/container_1660495066893_0006_01_000001/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__
    answer, self.gateway_client, self.target_id, self.name)
  File "/mnt/yarn/usercache/hadoop/appcache/application_1660495066893_0006/container_1660495066893_0006_01_000001/pyspark.zip/pyspark/sql/utils.py", line 63, in deco
    return f(*a, **kw)
  File "/mnt/yarn/usercache/hadoop/appcache/application_1660495066893_0006/container_1660495066893_0006_01_000001/py4j-0.10.7-src.zip/py4j/protocol.py", line 328, in get_return_value
    format(target_id, ".", name), value)
py4j.protocol.Py4JJavaError: An error occurred while calling o115.json.
: java.io.IOException: com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.SdkClientException: Unable to execute HTTP request: Remote host terminated the handshake
    at com.amazon.ws.emr.hadoop.fs.s3n.Jets3tNativeFileSystemStore.list(Jets3tNativeFileSystemStore.java:421)
    at com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem.listStatus(S3NativeFileSystem.java:654)
    at com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem.listStatus(S3NativeFileSystem.java:625)
    at com.amazon.ws.emr.hadoop.fs.EmrFileSystem.listStatus(EmrFileSystem.java:473)
    at

these json files are presents in S3, I downloaded some of the files to reproduce the issue in local, when I have smaller set of data, it is working fine, but in EMR im unable to reproduce.

also, I checked Application details of EMR for this step.

it says undefined status for status with the below details.

Details:org.apache.spark.sql.Dataset.collectToPython(Dataset.scala:3285)
sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
java.lang.reflect.Method.invoke(Method.java:498)
py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
py4j.Gateway.invoke(Gateway.java:282)
py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
py4j.commands.CallCommand.execute(CallCommand.java:79)
py4j.GatewayConnection.run(GatewayConnection.java:238)
java.lang.Thread.run(Thread.java:750)

spark session creation

from pyspark.sql import SparkSession
from pyspark.conf import SparkConf
spark_builder = (
        SparkSession\
        .builder\
        .config(conf=SparkConf())\
        .appName("test"))
spark = spark_builder.getOrCreate()

I am not sure, what went wrong suddenly with this step, please help.

CodePudding user response：

Your error indicates a failed security protocol as suggested by various results from googling all pointing to throttling/rejecting incoming TLS connections. Given that this occurs in the context of a backoff strategy.

You can further try these suggestions for retrying with exponential backoff strategy - here and limiting your requests by utilising the AMID.

Additionally you can check you DNS quotas to check if that is not limiting anything or exhausting your quota

Further add your Application Environment to further check if an outdated version might be causing this-

EMR Release version
Spark Versions
AWS SDK Version
AMI [ Amazon Linux Machine Images ] versions
Java & JVM Details
Hadoop Details

Recommended Environment would be to use - AMI 2.x , EMR - 5.3x and the compatible SDKs towards the same [ Preferably AWSS3JavaClient 1.11x ]

More info about EMR releases can be found here

Additionally provide a clear snippet , how are you exactly reading your json files from S3 , are you doing it in an iterative fashion , 1 after the other or in bulk or batches

References used -

CodePudding user response：

From your error message: ...SdkClientException: Unable to execute HTTP request: Remote host terminated the handshake, seems like you've got a security protocol that is not accepted by the host or the error indicates that the connection was closed on the service side before the SDK was able to perform handshake. You should add a try/except block and add some delay between retrys, to handle those

read = False
while not read:
    try:
        df_job_desc = spark.read.schema(schema_jd).option('multiline',"true").json(self.filepath)
        read = True
    except:
        time.sleep(1)
        pass