I have a PySpark DF with an array column with data such as: [0,-1,0,0,1,1,1]
I'd like to use higher order PySpark functions to convert all negative values to 0, for a result like: [0,0,0,0,1,1,1]
I've tried:
sdf = (sdf
.withColumn('ArrayCol',f.transform(f.col('ArrayCol'),lambda x: 0 if x < 0 else x))
)
but returns error "Cannot convert column into bool: please use '&' for 'and', '|' for 'or', '~' for 'not' when building DataFrame boolean expressions."
as well as:
sdf = (sdf
.withColumn('ArrayCol',f.expr("transform(ArrayCol, x -> CASE WHEN x < 0 THEN 0 ELSE x "))
)
But returns error: extraneous input 'WHEN' expecting {')', ','}(line 1, pos 37)
Any help would be greatly appreciated!
EDIT: Looks like:
.withColumn('ArrayCol',f.expr("transform(ArrayCol, x -> CASE WHEN x < 0 THEN 0 ELSE x END"))
woks.
I'd still be interested to understand how you could do this with a lambda function. Thanks!
CodePudding user response:
As stated here
functions defined in pyspark.sql.functions and Scala UserDefinedFunctions.
The lambda function should be expressed using PySpark's functions API (in your case when
and otherwise
), while 0 if x < 0 else x
is a native Python expression.
The lambda function could be like this
sdf.withColumn('ArrayCol', f.transform(f.col('ArrayCol'), lambda x: f.when(x < 0, 0).otherwise(x)))