How can we use isNull()
condition with date type columns in Pyspark
. I need to update a column name REGISTER_DATE when it is null and satisfying other conditions also??
data = [{"REGISTER_DATE": '', "FORM_DATE": '16-12-2022', "GENDER": Female, "Truth": True},
{"REGISTER_DATE": '13-09-2022', "FORM_DATE": '06-12-2022',"GENDER": Female"Truth":True},
{"REGISTER_DATE": '', "FORM_DATE": '20-12-2022', "GENDER": Female, "Truth": True},
{"REGISTER_DATE": '', "FORM_DATE": '18-12-2022', "GENDER": Female, "Truth": True}]
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
create_df = spark.createDataFrame(data)
The FORM_DATE
AND REGISTER_DATE
Columns are of Date Datatype and gender is of STRING
The Query I am trying is:
create_df=create_df.withColumn('REGISTER_DATE',F.when((col('REGISTER_DATE').isNull()) &
(col('FORM_DATE') !='') & (col('GENDER')== 'FEMALE'), create_df.FORM_DATE)
.otherwise(create_df['REGISTER_DATE']))
The query is running fine but it is not updating the column REGISTER_DATE even though the conditions are there in the source file.
According to my Knowledge, isNull
use with date type column may be invalid. I want to update REGISTER_DATE
Column nullvalue with FORM_DATE
Column value if condition matched.
CodePudding user response:
Let's say that you have this data
:
------------- ---------- ------ -----
|REGISTER_DATE|FORM_DATE |GENDER|TRUTH|
------------- ---------- ------ -----
|null |2022-12-16|Female|True |
|2022-09-13 |2022-12-06|Female|True |
|null |2022-12-20|Female|True |
|null |2022-12-18|Female|True |
------------- ---------- ------ -----
with this schema:
root
|-- REGISTER_DATE: date (nullable = true)
|-- FORM_DATE: date (nullable = true)
|-- GENDER: string (nullable = true) <-- doesn't matter
|-- TRUTH: string (nullable = true) <-- doesn't matter
If your type is date
(on REGISTER_DATE
and FORM_DATE
columns), they will never have an empty string; date
type is either a valid date or a null
.
If we do this:
data = data.withColumn("REGISTER_DATE",
when(col("REGISTER_DATE").isNull, col("FORM_DATE"))
.otherwise(col("REGISTER_DATE")))
Then, we will have this as output:
------------- ---------- ------ -----
|REGISTER_DATE|FORM_DATE |GENDER|TRUTH|
------------- ---------- ------ -----
|2022-12-16 |2022-12-16|Female|True |
|2022-09-13 |2022-12-06|Female|True |
|2022-12-20 |2022-12-20|Female|True |
|2022-12-18 |2022-12-18|Female|True |
------------- ---------- ------ -----
In your example, you are filling the data with empty strings and claiming that they are of type date
, but this can not be the case. Your &
condition is making the whole statement turn to false, and that might be the case because you are trying to compare date
type with empty string, which is never the case.
You just have to make sure that REGISTER_DATE
and FORM_DATE
are date
and not string
, and the casting (if you do it in the future) is correct.
Good luck!
CodePudding user response:
with the data you provided it is creating column of string data type. If that is the case then for empty string you have to use =='' condition. remember that an empty string is not same as null. However if you mean to say that your actual data has column of date data type and this example code/data having string is incorrect then we can not insert empty strings in a column of date data type in spark.
Your code is correct and it will work without any modification if
- your data is proper date datatype, not date stored in string data type. and
- value is proper null not empty string.
screenshot showing datatype of the dataframe created from data provided in question
screenshot showing it works with stringDatatType=='' & DateType can not accept object '' in type
CodePudding user response:
I hope it works for your solution, convert REGISTER_DATE and FORM_DATE to datetype and then implement your logic or convert into date and then use coalesce function just like in sql. I convert REGISTER_DATE and FORM_DATE back to string because if it contain any dependencies.
import findspark
findspark.init()
findspark.find()
from pyspark.sql import *
from pyspark.sql.functions import *
data = [{"REGISTER_DATE": '', "FORM_DATE": '16-12-2022', "GENDER": 'Female', "Truth": True},
{"REGISTER_DATE": '13-09-2022', "FORM_DATE": '06-12-2022',"GENDER": 'Female', "Truth":True},
{"REGISTER_DATE": '', "FORM_DATE": '20-12-2022', "GENDER": 'Female', "Truth": True},
{"REGISTER_DATE": '', "FORM_DATE": '18-12-2022', "GENDER": 'Female', "Truth": True}]# Define Schema without infering spark
spark = SparkSession.builder.getOrCreate()
create_df = spark.createDataFrame(data)
print('Before converting to date and handling null')
create_df.printSchema()
create_df.show()
create_df = create_df.withColumn('FORM_DATE', to_date(col('FORM_DATE'), 'dd-MM-yyyy'))
create_df = create_df.withColumn('REGISTER_DATE', to_date(col('REGISTER_DATE'), 'dd-MM-yyyy'))
# Solution 1:
# create_df = create_df.withColumn('REGISTER_DATE',when((col('REGISTER_DATE').isNull()) &
# (col('FORM_DATE').isNotNull()) & (col('GENDER')== 'Female'), create_df.FORM_DATE)
# .otherwise(create_df['REGISTER_DATE']))
# Solution 2:
create_df = create_df.withColumn('REGISTER_DATE', coalesce('REGISTER_DATE', 'FORM_DATE'))
# convert back to string
create_df = create_df.withColumn('FORM_DATE', col('FORM_DATE').cast('string'))
create_df = create_df.withColumn('REGISTER_DATE', col('REGISTER_DATE').cast('string'))
print('After converting to date and handling null')
create_df.printSchema()
create_df.show()