A date column in a file(In AWS S3) is in "July 28, 2021" Format.Since it is a file it is being treated as String datatype.I am trying to load the data in to RDS(Postgres). The RDS Column is in date datatype.
I am using the below line to convert the string into date but NULLS getting loaded in the date column , rest string/integer columns are getting loaded correctly.
df_S3=df_S3.withColumn('visit_date', to_date(df_S3.visit_date, 'MON DD, YYYY'))
I changed the date from "July 28, 2021" to "28-JUL-2021" in the S3 File and used the below line of code to process the data into RDS -
df_S3=df_S3.withColumn('visit_date', to_date(df_S3.visit_date, 'DD-MMM-YYYY'))
And dates got loaded correctly into RDS.
Could you please advise how to convert/load "July 28, 2021" into a date datatype column using PySpark ?
Thanks.
CodePudding user response:
You may try the following
df_S3=df_S3.withColumn('visit_date', to_date(df_S3.visit_date, 'MMMM d, yyyy'))
Based on the changes communicated in the latest docs:
We use MMMM
to match the full month name, d
to match the day number and yyyy
to match the year.
Full example
df_S3 = spark.createDataFrame([("July 28, 2021",),("March 5, 2013",)],schema="visit_date string")
df_S3.show()
-------------
| visit_date|
-------------
|July 28, 2021|
|March 5, 2013|
-------------
df_S3.withColumn('visit_date2', F.to_date(df_S3.visit_date, 'MMMM d, yyyy')).show()
------------- -----------
| visit_date|visit_date2|
------------- -----------
|July 28, 2021| 2021-07-28|
|March 5, 2013| 2013-03-05|
------------- -----------