There is pyspark dataframe like:
df = spark.createDataFrame([('7/1/20', )],['DATE'])
df.show()
------
| DATE|
------
|7/1/20|
------
Note! data format is (m)m/(d)d/yy, like: 1/25/20, 4/5/20, 11/2/20 etc.
I tried dessisions is here, but it returns empty dataframe. Example:
df = df.withColumn("DATE_1",F.to_date(F.col("DATE"),"%m/%d/%y"))
df.show()
But it returns Nan:
------ ------
| DATE|DATE_1|
------ ------
|7/1/20| null|
------ ------
What can I do for parsing dates like this?
CodePudding user response:
I'm using pyspark==3.2.1
and needed to set the configuration for spark.sql.legacy.timeParserPolicy
to LEGACY
to use the following solution:
spark.conf.set('spark.sql.legacy.timeParserPolicy', 'LEGACY')
import pyspark.sql.functions as f
df = spark.createDataFrame([
('7/1/20',)
], ['Date'])
df = (
df.withColumn('Date_1', f.to_date(f.col('Date'), 'MM/dd/yy'))
)
df.show(truncate= False)
output:
------ ----------
|Date |Date_1 |
------ ----------
|7/1/20|2020-07-01|
------ ----------
And if you don't want to set the configuration for spark.sql.legacy.timeParserPolicy
to LEGACY
, you can use this solution:
import pyspark.sql.functions as f
df = spark.createDataFrame([
('7/1/20',),
('10/1/20',),
('7/10/20',),
('10/10/20',)
], ['Date'])
df = (
df
.withColumn('Date_Converted', f.regexp_replace(f.col('Date'), '^([0-9]{1}/)', '0$1'))
.withColumn('Date_Converted', f.regexp_replace(f.col('Date_Converted'), '/([0-9]{1}/)', '/0$1'))
.withColumn('Date_1', f.to_date(f.col('Date_Converted'), 'MM/dd/yy'))
)
output:
-------- -------------- ----------
|Date |Date_Converted|Date_1 |
-------- -------------- ----------
|7/1/20 |07/01/20 |2020-07-01|
|10/1/20 |10/01/20 |2020-10-01|
|7/10/20 |07/10/20 |2020-07-10|
|10/10/20|10/10/20 |2020-10-10|
-------- -------------- ----------