Home > Net >  Spark UDF input with a List of columns
Spark UDF input with a List of columns

Time:08-24

I have a dataframe with N columns and I want to create a new column with the number of columns that have a NULL value. I tried to create an UDF but it's not working because of I can't set an array of parameters.

val simpleData = Seq(
("row1", "NULL" , "NULL" , "NULL" , "NULL" , "NULL", "1"),
("row2", "1", "NULL", "2023", "NULL", "01", "NULL"))


val myDs = simpleData.toDF("row", "field1", "field2",  "field3", "field4", "field5", "field6")
myDs.show()

val windowcols = myDs.columns.filterNot(List("row").contains(_))

def countNullsUDF: UserDefinedFunction = udf { (values: List[String]) =>
  values.filter( value => value == "NULL").length
}

myDs.withColumn("columnsWithNull", countNullsUDF(windowcols)).show(10, false)

is it possible to pass it an Array of columns or similar? I didn't get it.

CodePudding user response:

Actually what you did is almost correct. You can't pass a list of columns to a UDF, however, you can group all columns into one array column, and then pass that array column to your UDF:

import org.apache.spark.sql.functions.{array, col}

// ...

myDs.withColumn(
  "columnsWithNull",
  countNullsUDF(array(windowcols.map(col): _*))
).show(10, false)

 ---- ------ ------ ------ ------ ------ ------ --------------- 
|row |field1|field2|field3|field4|field5|field6|columnsWithNull|
 ---- ------ ------ ------ ------ ------ ------ --------------- 
|row1|NULL  |NULL  |NULL  |NULL  |NULL  |1     |5              |
|row2|1     |NULL  |2023  |NULL  |01    |NULL  |3              |
 ---- ------ ------ ------ ------ ------ ------ --------------- 

I only needed to transform your list of column names into a list of columns with the .map(col), and to use : _* to expand the list as array takes multiple parameters.

  • Related