Home > Software engineering >  Spark DataFrame Get Null Count For All Columns
Spark DataFrame Get Null Count For All Columns

Time:01-02

I have a DataFrame in which I would like to get the total null values count and I have the following that does this generically on all the columns:

First my DataFrame that just contains one column (for simplicity):

val recVacDate = dfRaw.select("STATE")

When I print using a simple filter, I get to see the following:

val filtered = recVacDate.filter("STATE is null")
println(filtered.count()) // Prints 94051

But when I use this code below, I get just 1 as a result and I do not understand why?

val nullCount = recVacDate.select(recVacDate.columns.map(c => count(col(c).isNull || col(c) === "" || col(c).isNaN).alias(c)): _*) 
println(nullCount.count()) // Prints 1

Any ideas as to what is wrong with the nullCount? The DataType of the column is a String.

CodePudding user response:

This kind of fixed it:

df.select(df.columns.map(c => count(when(col(c).isNull || col(c) === "" || col(c).isNaN, c)).alias(c)): _*)

Notice the use of when clause after the count.

  • Related