Processing Strings into Date Datatype column in RDS using PySpark-CodePudding

A date column in a file(In AWS S3) is in "July 28, 2021" Format.Since it is a file it is being treated as String datatype.I am trying to load the data in to RDS(Postgres). The RDS Column is in date datatype.

I am using the below line to convert the string into date but NULLS getting loaded in the date column , rest string/integer columns are getting loaded correctly. df_S3=df_S3.withColumn('visit_date', to_date(df_S3.visit_date, 'MON DD, YYYY'))

I changed the date from "July 28, 2021" to "28-JUL-2021" in the S3 File and used the below line of code to process the data into RDS - df_S3=df_S3.withColumn('visit_date', to_date(df_S3.visit_date, 'DD-MMM-YYYY'))

And dates got loaded correctly into RDS.

Could you please advise how to convert/load "July 28, 2021" into a date datatype column using PySpark ?

Thanks.

CodePudding user response：

You may try the following

df_S3=df_S3.withColumn('visit_date', to_date(df_S3.visit_date, 'MMMM d, yyyy'))

Based on the changes communicated in the latest docs:

We use MMMM to match the full month name, d to match the day number and yyyy to match the year.

Full example

df_S3 = spark.createDataFrame([("July 28, 2021",),("March 5, 2013",)],schema="visit_date string")
df_S3.show()

 ------------- 
|   visit_date|
 ------------- 
|July 28, 2021|
|March 5, 2013|
 -------------

df_S3.withColumn('visit_date2', F.to_date(df_S3.visit_date, 'MMMM d, yyyy')).show()

 ------------- ----------- 
|   visit_date|visit_date2|
 ------------- ----------- 
|July 28, 2021| 2021-07-28|
|March 5, 2013| 2013-03-05|
 ------------- -----------