Home > Blockchain >  Pyspark - Loop n times - Each loop gets gradually slower
Pyspark - Loop n times - Each loop gets gradually slower

Time:09-28

So basically i want to loop n times through my dataframe and apply a function in each loop (perform a join). My test-Dataframe is like 1000 rows and in each iteration, exactly one column will be added. The first three loops perform instantly and from then its gets really really slow. The 10th loop e.g. needs more than 10 minutes.

I dont understand why this happens because my Dataframe wont grow larger in terms of rows. If i call my functions with n=20 e.g., the join performs instantly. But when i loop iteratively 20 times, it gets stucked soon.

You have any idea what can potentially cause this problem?

CodePudding user response:

Examble Code from Evaluating Spark DataFrame in loop slows down with every iteration, all work done by controller

import time
from pyspark import SparkContext

sc = SparkContext()

def push_and_pop(rdd):
    # two transformations: moves the head element to the tail
    first = rdd.first()
    return rdd.filter(
        lambda obj: obj != first
    ).union(
        sc.parallelize([first])
    )

def serialize_and_deserialize(rdd):
    # perform a collect() action to evaluate the rdd and create a new instance
    return sc.parallelize(rdd.collect())

def do_test(serialize=False):
    rdd = sc.parallelize(range(1000))
    for i in xrange(25):
        t0 = time.time()
        rdd = push_and_pop(rdd)
        if serialize:
            rdd = serialize_and_deserialize(rdd)
        print "%.3f" % (time.time() - t0)

do_test()

CodePudding user response:

I have fixed this issue with converting the df every n times to a rdd and back to df. Code runs fast now. But i dont understand what exactly is the reason for that. The explain plan seems to rise very fast during iterations if i dont do the conversion. This fix is also issued in the book "High Performance Spark" with this workaround.

While the Catalyst optimizer is quite powerful, one of the cases where it currently runs into challenges is with very large query plans. These query plans tend to be the result of iterative algorithms, like graph algorithms or machine learning algorithms. One simple workaround for this is converting the data to an RDD and back to DataFrame/Dataset at the end of each iteration

  • Related