Home > Blockchain >  Conditional Transform on PySpark Array Column with Higher Order Functions
Conditional Transform on PySpark Array Column with Higher Order Functions

Time:08-26

I have a PySpark DF with an array column with data such as: [0,-1,0,0,1,1,1]

I'd like to use higher order PySpark functions to convert all negative values to 0, for a result like: [0,0,0,0,1,1,1]

I've tried:

sdf = (sdf
       .withColumn('ArrayCol',f.transform(f.col('ArrayCol'),lambda x: 0 if x < 0 else x))
)

but returns error "Cannot convert column into bool: please use '&' for 'and', '|' for 'or', '~' for 'not' when building DataFrame boolean expressions."

as well as:

sdf = (sdf
       .withColumn('ArrayCol',f.expr("transform(ArrayCol, x -> CASE WHEN x < 0 THEN 0 ELSE x "))
)

But returns error: extraneous input 'WHEN' expecting {')', ','}(line 1, pos 37)

Any help would be greatly appreciated!

EDIT: Looks like:

.withColumn('ArrayCol',f.expr("transform(ArrayCol, x -> CASE WHEN x < 0 THEN 0 ELSE x END")) woks.

I'd still be interested to understand how you could do this with a lambda function. Thanks!

CodePudding user response:

As stated here

functions defined in pyspark.sql.functions and Scala UserDefinedFunctions.

The lambda function should be expressed using PySpark's functions API (in your case when and otherwise), while 0 if x < 0 else x is a native Python expression.

The lambda function could be like this

sdf.withColumn('ArrayCol', f.transform(f.col('ArrayCol'), lambda x: f.when(x < 0, 0).otherwise(x)))
  • Related