Home > Software engineering >  How to use IsNull() condition with date Columns in Pyspark dataframes?
How to use IsNull() condition with date Columns in Pyspark dataframes?

Time:01-02

How can we use isNull() condition with date type columns in Pyspark. I need to update a column name REGISTER_DATE when it is null and satisfying other conditions also??

 data = [{"REGISTER_DATE": '', "FORM_DATE": '16-12-2022', "GENDER": Female, "Truth": True},
{"REGISTER_DATE": '13-09-2022', "FORM_DATE": '06-12-2022',"GENDER": Female"Truth":True},
{"REGISTER_DATE": '', "FORM_DATE": '20-12-2022', "GENDER": Female, "Truth": True},
{"REGISTER_DATE": '', "FORM_DATE": '18-12-2022', "GENDER": Female, "Truth": True}]


from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
create_df = spark.createDataFrame(data)

The FORM_DATE AND REGISTER_DATE Columns are of Date Datatype and gender is of STRING

The Query I am trying is:

create_df=create_df.withColumn('REGISTER_DATE',F.when((col('REGISTER_DATE').isNull()) & 
    (col('FORM_DATE') !='') & (col('GENDER')== 'FEMALE'), create_df.FORM_DATE)
    .otherwise(create_df['REGISTER_DATE']))

The query is running fine but it is not updating the column REGISTER_DATE even though the conditions are there in the source file.

According to my Knowledge, isNull use with date type column may be invalid. I want to update REGISTER_DATE Column nullvalue with FORM_DATE Column value if condition matched.

CodePudding user response:

Let's say that you have this data:

 ------------- ---------- ------ ----- 
|REGISTER_DATE|FORM_DATE |GENDER|TRUTH|
 ------------- ---------- ------ ----- 
|null         |2022-12-16|Female|True |
|2022-09-13   |2022-12-06|Female|True |
|null         |2022-12-20|Female|True |
|null         |2022-12-18|Female|True |
 ------------- ---------- ------ ----- 

with this schema:

root
 |-- REGISTER_DATE: date (nullable = true)
 |-- FORM_DATE: date (nullable = true)
 |-- GENDER: string (nullable = true) <-- doesn't matter
 |-- TRUTH: string (nullable = true)  <-- doesn't matter

If your type is date (on REGISTER_DATE and FORM_DATE columns), they will never have an empty string; date type is either a valid date or a null.

If we do this:

data = data.withColumn("REGISTER_DATE",
  when(col("REGISTER_DATE").isNull, col("FORM_DATE"))
    .otherwise(col("REGISTER_DATE")))

Then, we will have this as output:

 ------------- ---------- ------ ----- 
|REGISTER_DATE|FORM_DATE |GENDER|TRUTH|
 ------------- ---------- ------ ----- 
|2022-12-16   |2022-12-16|Female|True |
|2022-09-13   |2022-12-06|Female|True |
|2022-12-20   |2022-12-20|Female|True |
|2022-12-18   |2022-12-18|Female|True |
 ------------- ---------- ------ ----- 

In your example, you are filling the data with empty strings and claiming that they are of type date, but this can not be the case. Your & condition is making the whole statement turn to false, and that might be the case because you are trying to compare date type with empty string, which is never the case.

You just have to make sure that REGISTER_DATE and FORM_DATE are date and not string, and the casting (if you do it in the future) is correct.

Good luck!

CodePudding user response:

with the data you provided it is creating column of string data type. If that is the case then for empty string you have to use =='' condition. remember that an empty string is not same as null. However if you mean to say that your actual data has column of date data type and this example code/data having string is incorrect then we can not insert empty strings in a column of date data type in spark.

Your code is correct and it will work without any modification if

  1. your data is proper date datatype, not date stored in string data type. and
  2. value is proper null not empty string.

screenshot showing datatype of the dataframe created from data provided in question

screenshot showing it works with stringDatatType=='' & DateType can not accept object '' in type

CodePudding user response:

I hope it works for your solution, convert REGISTER_DATE and FORM_DATE to datetype and then implement your logic or convert into date and then use coalesce function just like in sql. I convert REGISTER_DATE and FORM_DATE back to string because if it contain any dependencies.

import findspark
findspark.init()
findspark.find()
from pyspark.sql import *
from  pyspark.sql.functions import *
data = [{"REGISTER_DATE": '', "FORM_DATE": '16-12-2022', "GENDER": 'Female', "Truth": True},
{"REGISTER_DATE": '13-09-2022', "FORM_DATE": '06-12-2022',"GENDER": 'Female', "Truth":True},
{"REGISTER_DATE": '', "FORM_DATE": '20-12-2022', "GENDER": 'Female', "Truth": True},
{"REGISTER_DATE": '', "FORM_DATE": '18-12-2022', "GENDER": 'Female', "Truth": True}]# Define Schema without infering spark
spark = SparkSession.builder.getOrCreate()
create_df = spark.createDataFrame(data)
print('Before converting to date and handling null')
create_df.printSchema()
create_df.show()
create_df = create_df.withColumn('FORM_DATE', to_date(col('FORM_DATE'), 'dd-MM-yyyy'))
create_df = create_df.withColumn('REGISTER_DATE', to_date(col('REGISTER_DATE'), 'dd-MM-yyyy'))
# Solution 1:
# create_df = create_df.withColumn('REGISTER_DATE',when((col('REGISTER_DATE').isNull()) & 
#     (col('FORM_DATE').isNotNull()) & (col('GENDER')== 'Female'), create_df.FORM_DATE)
#     .otherwise(create_df['REGISTER_DATE']))
# Solution 2:
create_df = create_df.withColumn('REGISTER_DATE', coalesce('REGISTER_DATE', 'FORM_DATE'))
# convert back to string 
create_df = create_df.withColumn('FORM_DATE', col('FORM_DATE').cast('string'))
create_df = create_df.withColumn('REGISTER_DATE', col('REGISTER_DATE').cast('string'))
print('After converting to date and handling null')
create_df.printSchema()
create_df.show()
  • Related