Home > database >  Best practice when using multiple when in Spark/PySpark
Best practice when using multiple when in Spark/PySpark

Time:09-07

Below is a code snipped which uses multiple when clauses (it's just a couple but it could well be 10s or more):

df = (df.withColumn('tmp_col',
                   when(df.some_col.isin(val_list),"val_1")
                  .when(df.some_col.isin(val_list),"val_2")
                  .otherwise(''))

When we have multiple such when conditions which can be easily be written in a loop to reduce code lines, should I do so or will that significantly affect performance. Or is there perhaps another better/efficient way to do so?

CodePudding user response:

As stated here, you could create a loop creating the big when expression. This way you will have the same performance as using many when expressions.

from pyspark.sql import functions as F

conds_vals = [
    [val_list1, 'val_1'],
    [val_list2, 'val_2'],
    [val_list3, 'val_3'],
]

whens = F
for cv in conds_vals:
    whens = whens.when(df.some_col.isin(cv[0]), cv[1])
whens = whens.otherwise('')

df = df.withColumn('tmp_col', whens)
  • Related