pyspark - fill null date values with an old date-CodePudding

Question: Following code fails to replace null date values to 12/31/1900 in a date column. What I may be doing wrong here, and how can we fix the issue?

Dataframe df is loaded from a Data file has a column Order_date with values in MM/dd/yyyy format. I'm importing this data file into a SQL table of Azure SQL Db that has a corresponding column with same name Order_date. Following code successfully imports data into SQL table. But the null values in Order_date column are not replaced with 12/31/1900:

Remarks:

The Order_date column in SQL table has to be of datatype date
For Order_date column, when I tried the following, I got the error saying dataframe column is of StringType and destination column is of DateType. .withColumn("Order_date",F.when(df.Order_date.isNull(),to_date(lit("12/31/1900"),"'MM/dd/yyyy'")).otherwise(df.Order_date))

from pyspark.sql import functions as F
......
df1 = df.withColumn("CustomerID", df.CustomerID.cast(IntegerType())) \
 .withColumn("Order_date", F.to_date(col('Order_date'), 'MM/dd/yyyy'))

df2 = df1.fillna({'Order_date': '12/31/1900'})

#load df2 into SQL table
df2.write(.....)

CodePudding user response：

can you check whether null is causing issue.

in pyspark if column is empty then it will be treated as null

example :

Source csv Data:

Order_date|CustomerID
11-10-2022|89296
12-10-2022|67760
|879798

>>> df = spark.read.csv("/Path to csv/sample1.csv", sep="|", header="true", inferSchema="false")

>>> df.show()
 ---------- ---------- 
|Order_date|CustomerID|
 ---------- ---------- 
|11-10-2022|     89296|
|12-10-2022|     67760|
|      null|    879798|
 ---------- ---------- 

>>> df.filter("Order_date is NULL").show()
 ---------- ---------- 
|Order_date|CustomerID|
 ---------- ---------- 
|      null|    879798|
 ---------- ---------- 

whereas if it is hardcoded null in data then it will be treated as string type

Source csv data :

Order_date|CustomerID
11-10-2022|89296
12-10-2022|67760
null|879798

>>> df_test = spark.read.csv("/path to csv/sample1.csv", sep="|", header="true", inferSchema="false")
>>> df_test.show()
 ---------- ---------- 
|Order_date|CustomerID|
 ---------- ---------- 
|11-10-2022|     89296|
|12-10-2022|     67760|
|      null|    879798|
 ---------- ---------- 


>>> df_test.filter("Order_date=='null'").show()
 ---------- ---------- 
|Order_date|CustomerID|
 ---------- ---------- 
|      null|    879798|
 ---------- ----------

CodePudding user response：

Consider replacing the null and empty string values in Order_date before performing your to_date() conversion, e.g.:

from pyspark.sql import functions as F
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName('Example').getOrCreate()
data = [
    { 'Order_date': '02/28/1997'},
    { 'Order_date': ''},
    { 'Order_date': None}
]
df = spark.createDataFrame(data)
df.show()
#  ----------                                                                     
# |Order_date|
#  ---------- 
# |02/28/1997|
# |          |
# |      null|
#  ---------- 

df2 = df.withColumn('Order_date2', F.col('Order_date'))
df2 = df2.na.replace('', '01/01/1900', 'Order_date2') # Replaces empty strings
df2 = df2.na.fill({'Order_date2': '01/01/1900'}) # Replaces `null` values
df2.show()
#  ---------- ----------- 
# |Order_date|Order_date2|
#  ---------- ----------- 
# |02/28/1997| 02/28/1997|
# |          | 01/01/1900|
# |      null| 01/01/1900|
#  ---------- ----------- 

df3 = df2.withColumn('Order_date3', F.to_date(df2.Order_date2, 'MM/dd/yyyy'))
df3.show()
#  ---------- ----------- ----------- 
# |Order_date|Order_date2|Order_date3|
#  ---------- ----------- ----------- 
# |02/28/1997| 02/28/1997| 1997-02-28|
# |          | 01/01/1900| 1900-01-01|
# |      null| 01/01/1900| 1900-01-01|
#  ---------- ----------- -----------