Home > other >  The spark deserialization sometimes time is too long
The spark deserialization sometimes time is too long

Time:09-24


In the spark in the cluster, the server's configuration is differ, the server configuration is a bit poor sometimes mission deserialized to 2 minutes, using serialization is org. Apache. Spark. Serializer. KryoSerializer, what reason is this excuse me?

The server configuration: 32 core/126 g
Operation mode: run on yarn
Resource use: less than one 5 of the


CodePudding user response:

CodePudding user response:

Just saw an article that may help you here

CodePudding user response:

First of all thank you for your reply, upstairs
I am already using Kryo are faced with the problem, and also to have some kind of registered, however, about 200 times the execution of the stag there will always be one or two stage deserialization time to a minimum of 2 minutes (normal processing time is about 4 seconds), and processing the data is not much,

CodePudding user response:

The
reference 3 floor u012591139 response:
first of all thank you for your reply, upstairs
I am already using Kryo are faced with the problem, and also to have some kind of registered, however, about 200 times the execution of the stag there will always be one or two stage deserialization time to a minimum of 2 minutes (normal processing time is about 4 seconds), and processing the data is not much,

The visual is data skew, a larger partition data is assigned to the node, with poor data skew there are a lot of solution, depending on the data itself to processing, are essential to each partition as similar number assigned to the record,

CodePudding user response:

Agree with above, should be the data skew, can increase the number of partitions, or a custom partitioning method to solve it

CodePudding user response:

It may not necessary be The data skew. The OP already mentioned that The data volume for this task is not much company with other tasks.

But the OP is not clear if there is always ONE task per executor is much slower than the rest tasks due to the task deserializing much longer.

If this IS the case, that IS most likely because of the time seems to ship the jars from the driver to the executors. You should only pay this cost once per spark context (assuming You are not adding more jars later on).

When you submit your spark jobs, how large is your jar file? A hundred Ks is much difference as hundred Ms.
  • Related