Below is a code snipped which uses multiple when clauses (it's just a couple but it could well be 10s or more):
df = (df.withColumn('tmp_col',
when(df.some_col.isin(val_list),"val_1")
.when(df.some_col.isin(val_list),"val_2")
.otherwise(''))
When we have multiple such when
conditions which can be easily be written in a loop to reduce code lines, should I do so or will that significantly affect performance. Or is there perhaps another better/efficient way to do so?
CodePudding user response:
As stated here, you could create a loop creating the big when
expression. This way you will have the same performance as using many when
expressions.
from pyspark.sql import functions as F
conds_vals = [
[val_list1, 'val_1'],
[val_list2, 'val_2'],
[val_list3, 'val_3'],
]
whens = F
for cv in conds_vals:
whens = whens.when(df.some_col.isin(cv[0]), cv[1])
whens = whens.otherwise('')
df = df.withColumn('tmp_col', whens)