Assume, I have a dataframe with only one column text
. I want to write a udf function that extract multiple text spans from each row of this dataframe. I have the following function:
@F.udf(returnType=T.ArrayType(T.StringType()))
def generate_text_spans(text):
spans = []
# performs some processing on text and fills spans variable
return spans
df = df.withColumn('spans', generate_text_spans(F.col('text')))
df.collect()
The problem is that when I run this code, the whole process happens on only one core while I have dedicated near 100 cores to my pyspark (everything else goes smoothly on all cores but this task works only on one core). Do operations inside udf affect this behaviour?
I really appreciate any ideas on this.
CodePudding user response:
Use repartitions
df = df.repartition(100).withColumn('spans', generate_text_spans(F.col('text')))
CodePudding user response:
Apparently, this happens when you persist()
your data and show()
it. Removing these lines did the trick for me. Now all cores are being used.