I have a DataFrame in which I would like to get the total null values count and I have the following that does this generically on all the columns:
First my DataFrame that just contains one column (for simplicity):
val recVacDate = dfRaw.select("STATE")
When I print using a simple filter, I get to see the following:
val filtered = recVacDate.filter("STATE is null")
println(filtered.count()) // Prints 94051
But when I use this code below, I get just 1 as a result and I do not understand why?
val nullCount = recVacDate.select(recVacDate.columns.map(c => count(col(c).isNull || col(c) === "" || col(c).isNaN).alias(c)): _*)
println(nullCount.count()) // Prints 1
Any ideas as to what is wrong with the nullCount? The DataType of the column is a String.
CodePudding user response:
This kind of fixed it:
df.select(df.columns.map(c => count(when(col(c).isNull || col(c) === "" || col(c).isNaN, c)).alias(c)): _*)
Notice the use of when clause after the count.