Home > Software engineering >  PySpark - Resolving isnan errors with TimeStamp datatype
PySpark - Resolving isnan errors with TimeStamp datatype

Time:12-25

I'm trying to create a function to check the quality of data (nans/nulls etc) I have the following code running on a PySpark DataFrame

df.select([f.count(f.when((f.isnan(c) | f.col(c).isNull()), c)).alias(c) for c in cols_check]).show()

As long as the columns to check are strings/integers, I have no issue. However when I check columns with the datatype of date or timestamp I receive the following error:

cannot resolve 'isnan(Date_Time)' due to data type mismatch: argument 1 requires (double or float) type, however, 'Date_Time' is of timestamp type.;;\n'Aggregate...

There are clear null values in the column, how can I remedy this?

CodePudding user response:

You can use df.dtypes to check the type of each column and be able to handle timestamp and date null count differently like this:

from pyspark.sql import functions as F

df.select(*[
    (
        F.count(F.when((F.isnan(c) | F.col(c).isNull()), c)) if t not in ("timestamp", "date")
        else F.count(F.when(F.col(c).isNull(), c))
    ).alias(c)
    for c, t in df.dtypes if c in cols_check
]).show()
  • Related