I have this code:
dfSpark = dfSpark\
.withColumn('colA', when(col('colA') == True, 'S').otherwise('N')) \
.withColumn('colB', when(col('colB') == True, 'S').otherwise('N')) \
.withColumn('colC', when(col('colC') == True, 'S').otherwise('N')) \
.withColumn('colD', when(col('colD') == True, 'S').otherwise('N')) \
.withColumn('colE', when(col('colE') == True, 'S').otherwise('N')) \
.withColumn('colF', when(col('colF') == True, 'S').otherwise('N')) \
.withColumn('colG', when(col('colG') == True, 'S').otherwise('N')) \
.withColumn('colH', when(col('colH') == True, 'S').otherwise('N')) \
.withColumn('colI', when(col('colI') == True, 'S').otherwise('N')) \
.withColumn('colJ', when(col('colJ') == True, 'S').otherwise('N'))
Is there any other way that could be more efficient and not redundant? Something like a lambda in python?
Thanks!
CodePudding user response:
Use select with list comprehension:
from pyspark.sql import functions as F
update_cols = ["colA", "colB", "colC", "colD", "colE", "colF", "colG", "colH", "colI", "colJ"]
dfSpark = dfSpark.select(
*[F.col(c) for c in df.columns if c not in update_cols],
*[F.when(F.col(c), "S").otherwise("N") for c in update_cols]
)
CodePudding user response:
You can use UDF (user defined functions):
from pyspark.sql.functions import udf, col
from pyspark.sql.types import StringType
@udf(returnType=StringType())
def udf_colA(colA):
if colA == True:
return "S"
else:
return "N"
and call it as:
dfSpark = dfSpark \
.withColumn('colA', udf_colA(col('colA'))
But remember, UDFs involve repeated serialization, deserialization and data movement between Python interpreter and JVM, and may be very slow for huge datasets. The vectorised form (as you've coded) is more efficient.
There are vectorised UDFs, as well. Please check https://databricks.com/blog/2017/10/30/introducing-vectorized-udfs-for-pyspark.html
Performance: Spark functions vs UDF performance?