Home > other >  Converting column data type from string to date with PySpark returns null values
Converting column data type from string to date with PySpark returns null values

Time:07-27

I was trying to change the datatype of a column (Disponibility) from string type to date, but every time it shows this column converted as null values (for example: 23/01/2022 becomes null ....)

This is my code:

dfwdate = dfworkers2.withColumn("Disponibility", to_date("Disponibility")) \
.show(truncate=False)   

CodePudding user response:

You additionally need to supply the date_format with to_date , various format references can be found Spark date pattern documentation page

Date Conversion Examples

s = StringIO("""
date_str
2022-03-01
2022-05-20
2022-06-21
2022-10-22
""")

df = pd.read_csv(s,delimiter=',')

sparkDF = sql.createDataFrame(df)\
             .withColumn('date_parsed',F.to_date(F.col('date_str'), 'yyyy-MM-dd'))\
             .drop('date_str')

sparkDF.show()

 ----------- 
|date_parsed|
 ----------- 
| 2022-03-01|
| 2022-05-20|
| 2022-06-21|
| 2022-10-22|
 ----------- 

sparkDF.printSchema()

root
 |-- date_parsed: date (nullable = true)

CodePudding user response:

to_date('Disponibility', 'dd/MM/yyyy')

You have correctly chosen the function to_date. It's just that in your case it requires another parameter - the actual format of your date string 'dd-MM-yyyy'. By default, to_date expects to get the format 'yyyy-MM-dd'. Since your column is not of this format, you get null returned.

Full example:

from pyspark.sql import functions as F
df = spark.createDataFrame([('23/01/2022',)], ['Disponibility'])

df.show()
#  ------------- 
# |Disponibility|
#  ------------- 
# |   23/01/2022|
#  ------------- 
print(df.dtypes)
# [('Disponibility', 'string')]

df = df.withColumn('Disponibility', F.to_date('Disponibility', 'dd/MM/yyyy'))

df.show()
#  ------------- 
# |Disponibility|
#  ------------- 
# |   2022-01-23|
#  ------------- 
print(df.dtypes)
# [('Disponibility', 'date')]
  • Related