I have a dataframe with N columns and I want to create a new column with the number of columns that have a NULL value. I tried to create an UDF but it's not working because of I can't set an array of parameters.
val simpleData = Seq(
("row1", "NULL" , "NULL" , "NULL" , "NULL" , "NULL", "1"),
("row2", "1", "NULL", "2023", "NULL", "01", "NULL"))
val myDs = simpleData.toDF("row", "field1", "field2", "field3", "field4", "field5", "field6")
myDs.show()
val windowcols = myDs.columns.filterNot(List("row").contains(_))
def countNullsUDF: UserDefinedFunction = udf { (values: List[String]) =>
values.filter( value => value == "NULL").length
}
myDs.withColumn("columnsWithNull", countNullsUDF(windowcols)).show(10, false)
is it possible to pass it an Array of columns or similar? I didn't get it.
CodePudding user response:
Actually what you did is almost correct. You can't pass a list of columns to a UDF, however, you can group all columns into one array column, and then pass that array column to your UDF:
import org.apache.spark.sql.functions.{array, col}
// ...
myDs.withColumn(
"columnsWithNull",
countNullsUDF(array(windowcols.map(col): _*))
).show(10, false)
---- ------ ------ ------ ------ ------ ------ ---------------
|row |field1|field2|field3|field4|field5|field6|columnsWithNull|
---- ------ ------ ------ ------ ------ ------ ---------------
|row1|NULL |NULL |NULL |NULL |NULL |1 |5 |
|row2|1 |NULL |2023 |NULL |01 |NULL |3 |
---- ------ ------ ------ ------ ------ ------ ---------------
I only needed to transform your list of column names into a list of columns with the .map(col)
, and to use : _*
to expand the list as array
takes multiple parameters.