I am new to EMR and Bigdata,
We have an EMR step and that was working fine till last month, currently I am getting the below error.
--- Logging error ---
Traceback (most recent call last):
File "/mnt/yarn/usercache/hadoop/appcache/application_1660495066893_0006/container_1660495066893_0006_01_000001/src.zip/src/source/Data_Extraction.py", line 59, in process_job_description
df_job_desc = spark.read.schema(schema_jd).option('multiline',"true").json(self.filepath)
File "/mnt/yarn/usercache/hadoop/appcache/application_1660495066893_0006/container_1660495066893_0006_01_000001/pyspark.zip/pyspark/sql/readwriter.py", line 274, in json
return self._df(self._jreader.json(self._spark._sc._jvm.PythonUtils.toSeq(path)))
File "/mnt/yarn/usercache/hadoop/appcache/application_1660495066893_0006/container_1660495066893_0006_01_000001/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__
answer, self.gateway_client, self.target_id, self.name)
File "/mnt/yarn/usercache/hadoop/appcache/application_1660495066893_0006/container_1660495066893_0006_01_000001/pyspark.zip/pyspark/sql/utils.py", line 63, in deco
return f(*a, **kw)
File "/mnt/yarn/usercache/hadoop/appcache/application_1660495066893_0006/container_1660495066893_0006_01_000001/py4j-0.10.7-src.zip/py4j/protocol.py", line 328, in get_return_value
format(target_id, ".", name), value)
py4j.protocol.Py4JJavaError: An error occurred while calling o115.json.
: java.io.IOException: com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.SdkClientException: Unable to execute HTTP request: Remote host terminated the handshake
at com.amazon.ws.emr.hadoop.fs.s3n.Jets3tNativeFileSystemStore.list(Jets3tNativeFileSystemStore.java:421)
at com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem.listStatus(S3NativeFileSystem.java:654)
at com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem.listStatus(S3NativeFileSystem.java:625)
at com.amazon.ws.emr.hadoop.fs.EmrFileSystem.listStatus(EmrFileSystem.java:473)
at
these json files are presents in S3, I downloaded some of the files to reproduce the issue in local, when I have smaller set of data, it is working fine, but in EMR im unable to reproduce.
also, I checked Application details of EMR for this step.
it says undefined
status for status with the below details.
Details:org.apache.spark.sql.Dataset.collectToPython(Dataset.scala:3285)
sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
java.lang.reflect.Method.invoke(Method.java:498)
py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
py4j.Gateway.invoke(Gateway.java:282)
py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
py4j.commands.CallCommand.execute(CallCommand.java:79)
py4j.GatewayConnection.run(GatewayConnection.java:238)
java.lang.Thread.run(Thread.java:750)
spark session creation
from pyspark.sql import SparkSession
from pyspark.conf import SparkConf
spark_builder = (
SparkSession\
.builder\
.config(conf=SparkConf())\
.appName("test"))
spark = spark_builder.getOrCreate()
I am not sure, what went wrong suddenly with this step, please help.
CodePudding user response:
Your error indicates a failed security protocol as suggested by various results from googling all pointing to throttling/rejecting incoming TLS connections. Given that this occurs in the context of a backoff strategy.
You can further try these suggestions for retrying with exponential backoff strategy - here and limiting your requests by utilising the AMID.
Additionally you can check you DNS quotas to check if that is not limiting anything or exhausting your quota
Further add your Application Environment to further check if an outdated version might be causing this-
- EMR Release version
- Spark Versions
- AWS SDK Version
- AMI [ Amazon Linux Machine Images ] versions
- Java & JVM Details
- Hadoop Details
Recommended Environment would be to use - AMI 2.x
, EMR - 5.3x
and the compatible SDKs towards the same [ Preferably AWSS3JavaClient 1.11x ]
More info about EMR releases can be found here
Additionally provide a clear snippet , how are you exactly reading your json files from S3 , are you doing it in an iterative fashion , 1 after the other or in bulk or batches
References used -
- https://github.com/aws/aws-sdk-java/issues/2269
- javax.net.ssl.SSLHandshakeException: Remote host closed connection during handshake during web service communicaiton
- https://github.com/aws/aws-sdk-java/issues/1405
CodePudding user response:
From your error message: ...SdkClientException: Unable to execute HTTP request: Remote host terminated the handshake
, seems like you've got a security protocol that is not accepted by the host or the error indicates that the connection was closed on the service side before the SDK was able to perform handshake. You should add a try/except block and add some delay between retrys, to handle those
read = False
while not read:
try:
df_job_desc = spark.read.schema(schema_jd).option('multiline',"true").json(self.filepath)
read = True
except:
time.sleep(1)
pass