I'm trying to create a function to check the quality of data (nans/nulls etc) I have the following code running on a PySpark DataFrame
df.select([f.count(f.when((f.isnan(c) | f.col(c).isNull()), c)).alias(c) for c in cols_check]).show()
As long as the columns to check are strings/integers, I have no issue. However when I check columns with the datatype of date
or timestamp
I receive the following error:
cannot resolve 'isnan(
Date_Time
)' due to data type mismatch: argument 1 requires (double or float) type, however, 'Date_Time
' is of timestamp type.;;\n'Aggregate...
There are clear null values in the column, how can I remedy this?
CodePudding user response:
You can use df.dtypes
to check the type of each column and be able to handle timestamp
and date
null count differently like this:
from pyspark.sql import functions as F
df.select(*[
(
F.count(F.when((F.isnan(c) | F.col(c).isNull()), c)) if t not in ("timestamp", "date")
else F.count(F.when(F.col(c).isNull(), c))
).alias(c)
for c, t in df.dtypes if c in cols_check
]).show()