Why pyspark udf function only runs on one core?-CodePudding

Assume, I have a dataframe with only one column text. I want to write a udf function that extract multiple text spans from each row of this dataframe. I have the following function:

@F.udf(returnType=T.ArrayType(T.StringType()))
def generate_text_spans(text):
    spans = []
    # performs some processing on text and fills spans variable
    return spans

df = df.withColumn('spans', generate_text_spans(F.col('text')))
df.collect()

The problem is that when I run this code, the whole process happens on only one core while I have dedicated near 100 cores to my pyspark (everything else goes smoothly on all cores but this task works only on one core). Do operations inside udf affect this behaviour?

I really appreciate any ideas on this.

CodePudding user response：

Use repartitions

df = df.repartition(100).withColumn('spans', generate_text_spans(F.col('text')))

CodePudding user response：

Apparently, this happens when you persist() your data and show() it. Removing these lines did the trick for me. Now all cores are being used.