Home > database >  pyspark - fill null date values with an old date
pyspark - fill null date values with an old date

Time:06-12

Question: Following code fails to replace null date values to 12/31/1900 in a date column. What I may be doing wrong here, and how can we fix the issue?

Dataframe df is loaded from a Data file has a column Order_date with values in MM/dd/yyyy format. I'm importing this data file into a SQL table of Azure SQL Db that has a corresponding column with same name Order_date. Following code successfully imports data into SQL table. But the null values in Order_date column are not replaced with 12/31/1900:

Remarks:

  1. The Order_date column in SQL table has to be of datatype date
  2. For Order_date column, when I tried the following, I got the error saying dataframe column is of StringType and destination column is of DateType. .withColumn("Order_date",F.when(df.Order_date.isNull(),to_date(lit("12/31/1900"),"'MM/dd/yyyy'")).otherwise(df.Order_date))

.

from pyspark.sql import functions as F
......
df1 = df.withColumn("CustomerID", df.CustomerID.cast(IntegerType())) \
 .withColumn("Order_date", F.to_date(col('Order_date'), 'MM/dd/yyyy'))

df2 = df1.fillna({'Order_date': '12/31/1900'})

#load df2 into SQL table
df2.write(.....)

CodePudding user response:

can you check whether null is causing issue.

in pyspark if column is empty then it will be treated as null

example :

Source csv Data:

Order_date|CustomerID
11-10-2022|89296
12-10-2022|67760
|879798

>>> df = spark.read.csv("/Path to csv/sample1.csv", sep="|", header="true", inferSchema="false")

>>> df.show()
 ---------- ---------- 
|Order_date|CustomerID|
 ---------- ---------- 
|11-10-2022|     89296|
|12-10-2022|     67760|
|      null|    879798|
 ---------- ---------- 

>>> df.filter("Order_date is NULL").show()
 ---------- ---------- 
|Order_date|CustomerID|
 ---------- ---------- 
|      null|    879798|
 ---------- ---------- 

whereas if it is hardcoded null in data then it will be treated as string type

Source csv data :

Order_date|CustomerID
11-10-2022|89296
12-10-2022|67760
null|879798

>>> df_test = spark.read.csv("/path to csv/sample1.csv", sep="|", header="true", inferSchema="false")
>>> df_test.show()
 ---------- ---------- 
|Order_date|CustomerID|
 ---------- ---------- 
|11-10-2022|     89296|
|12-10-2022|     67760|
|      null|    879798|
 ---------- ---------- 


>>> df_test.filter("Order_date=='null'").show()
 ---------- ---------- 
|Order_date|CustomerID|
 ---------- ---------- 
|      null|    879798|
 ---------- ---------- 

CodePudding user response:

Consider replacing the null and empty string values in Order_date before performing your to_date() conversion, e.g.:

from pyspark.sql import functions as F
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName('Example').getOrCreate()
data = [
    { 'Order_date': '02/28/1997'},
    { 'Order_date': ''},
    { 'Order_date': None}
]
df = spark.createDataFrame(data)
df.show()
#  ----------                                                                     
# |Order_date|
#  ---------- 
# |02/28/1997|
# |          |
# |      null|
#  ---------- 

df2 = df.withColumn('Order_date2', F.col('Order_date'))
df2 = df2.na.replace('', '01/01/1900', 'Order_date2') # Replaces empty strings
df2 = df2.na.fill({'Order_date2': '01/01/1900'}) # Replaces `null` values
df2.show()
#  ---------- ----------- 
# |Order_date|Order_date2|
#  ---------- ----------- 
# |02/28/1997| 02/28/1997|
# |          | 01/01/1900|
# |      null| 01/01/1900|
#  ---------- ----------- 

df3 = df2.withColumn('Order_date3', F.to_date(df2.Order_date2, 'MM/dd/yyyy'))
df3.show()
#  ---------- ----------- ----------- 
# |Order_date|Order_date2|Order_date3|
#  ---------- ----------- ----------- 
# |02/28/1997| 02/28/1997| 1997-02-28|
# |          | 01/01/1900| 1900-01-01|
# |      null| 01/01/1900| 1900-01-01|
#  ---------- ----------- ----------- 

  • Related