Conditional Transform on PySpark Array Column with Higher Order Functions-CodePudding

I have a PySpark DF with an array column with data such as: [0,-1,0,0,1,1,1]

I'd like to use higher order PySpark functions to convert all negative values to 0, for a result like: [0,0,0,0,1,1,1]

I've tried:

sdf = (sdf
       .withColumn('ArrayCol',f.transform(f.col('ArrayCol'),lambda x: 0 if x < 0 else x))
)

but returns error "Cannot convert column into bool: please use '&' for 'and', '|' for 'or', '~' for 'not' when building DataFrame boolean expressions."

as well as:

sdf = (sdf
       .withColumn('ArrayCol',f.expr("transform(ArrayCol, x -> CASE WHEN x < 0 THEN 0 ELSE x "))
)

But returns error: extraneous input 'WHEN' expecting {')', ','}(line 1, pos 37)

Any help would be greatly appreciated!

EDIT: Looks like:

.withColumn('ArrayCol',f.expr("transform(ArrayCol, x -> CASE WHEN x < 0 THEN 0 ELSE x END")) woks.

I'd still be interested to understand how you could do this with a lambda function. Thanks!

CodePudding user response：

As stated here

functions defined in pyspark.sql.functions and Scala UserDefinedFunctions.

The lambda function should be expressed using PySpark's functions API (in your case when and otherwise), while 0 if x < 0 else x is a native Python expression.

The lambda function could be like this

sdf.withColumn('ArrayCol', f.transform(f.col('ArrayCol'), lambda x: f.when(x < 0, 0).otherwise(x)))