Question: Following code fails to replace null date values to 12/31/1900
in a date column. What I may be doing wrong here, and how can we fix the issue?
Dataframe df
is loaded from a Data file has a column Order_date
with values in MM/dd/yyyy
format. I'm importing this data file into a SQL table of Azure SQL Db
that has a corresponding column with same name Order_date
. Following code successfully imports data into SQL table. But the null values in Order_date
column are not replaced with 12/31/1900
:
Remarks:
- The
Order_date
column in SQL table has to be of datatypedate
- For
Order_date
column, when I tried the following, I got the error saying dataframe column is of StringType and destination column is of DateType..withColumn("Order_date",F.when(df.Order_date.isNull(),to_date(lit("12/31/1900"),"'MM/dd/yyyy'")).otherwise(df.Order_date))
.
from pyspark.sql import functions as F
......
df1 = df.withColumn("CustomerID", df.CustomerID.cast(IntegerType())) \
.withColumn("Order_date", F.to_date(col('Order_date'), 'MM/dd/yyyy'))
df2 = df1.fillna({'Order_date': '12/31/1900'})
#load df2 into SQL table
df2.write(.....)
CodePudding user response:
can you check whether null is causing issue.
in pyspark if column is empty then it will be treated as null
example :
Source csv Data:
Order_date|CustomerID
11-10-2022|89296
12-10-2022|67760
|879798
>>> df = spark.read.csv("/Path to csv/sample1.csv", sep="|", header="true", inferSchema="false")
>>> df.show()
---------- ----------
|Order_date|CustomerID|
---------- ----------
|11-10-2022| 89296|
|12-10-2022| 67760|
| null| 879798|
---------- ----------
>>> df.filter("Order_date is NULL").show()
---------- ----------
|Order_date|CustomerID|
---------- ----------
| null| 879798|
---------- ----------
whereas if it is hardcoded null in data then it will be treated as string type
Source csv data :
Order_date|CustomerID
11-10-2022|89296
12-10-2022|67760
null|879798
>>> df_test = spark.read.csv("/path to csv/sample1.csv", sep="|", header="true", inferSchema="false")
>>> df_test.show()
---------- ----------
|Order_date|CustomerID|
---------- ----------
|11-10-2022| 89296|
|12-10-2022| 67760|
| null| 879798|
---------- ----------
>>> df_test.filter("Order_date=='null'").show()
---------- ----------
|Order_date|CustomerID|
---------- ----------
| null| 879798|
---------- ----------
CodePudding user response:
Consider replacing the null
and empty string values in Order_date
before performing your to_date()
conversion, e.g.:
from pyspark.sql import functions as F
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('Example').getOrCreate()
data = [
{ 'Order_date': '02/28/1997'},
{ 'Order_date': ''},
{ 'Order_date': None}
]
df = spark.createDataFrame(data)
df.show()
# ----------
# |Order_date|
# ----------
# |02/28/1997|
# | |
# | null|
# ----------
df2 = df.withColumn('Order_date2', F.col('Order_date'))
df2 = df2.na.replace('', '01/01/1900', 'Order_date2') # Replaces empty strings
df2 = df2.na.fill({'Order_date2': '01/01/1900'}) # Replaces `null` values
df2.show()
# ---------- -----------
# |Order_date|Order_date2|
# ---------- -----------
# |02/28/1997| 02/28/1997|
# | | 01/01/1900|
# | null| 01/01/1900|
# ---------- -----------
df3 = df2.withColumn('Order_date3', F.to_date(df2.Order_date2, 'MM/dd/yyyy'))
df3.show()
# ---------- ----------- -----------
# |Order_date|Order_date2|Order_date3|
# ---------- ----------- -----------
# |02/28/1997| 02/28/1997| 1997-02-28|
# | | 01/01/1900| 1900-01-01|
# | null| 01/01/1900| 1900-01-01|
# ---------- ----------- -----------