spark.sql("select case when length(regexp_replace(date,'[^0-9]', ''))==8 then regexp_replace(date,'[^0-9]', '') else regexp_replace(date,'[^0-9]','') end as date from input").show(false)
In the above I need to add the requirements such as,
1.the output should be validated with the format 'yyyymmdd
' using unix_timestamp.
- if it is not valid then should transform the extracted digits string by moving the first four (4) characters to the end of the extracted digits string (
MMDDYYYY to YYYYMMDD
) and then should be validated with the 'yyyymmdd
' format, if this condition is satisfied then print that date.
I'm not sure how to include the Unix timestamp in my query.
Sample input and output 1:
input: 2021dgsth02hdg02
output: 20210202
Sample input and output 2:
input: 0101def20dr21 (note: MMDDYYYY TO YYYYMMDD)
output: 20210101
Using unix_timestamp in place of to_date
spark.sql("select (case when length(regexp_replace(date,'[^0-9]', ''))==8 then CASE WHEN from_unixtime(unix_timestamp(regexp_replace(date,'[a-zA-Z] ',''),'yyyyMMdd') ,'yyyyMMdd') IS NULL THEN from_unixtime(unix_timestamp(regexp_replace(date,'[a-zA-Z] ',''),'MMddyyyy') ,'MMddyyyy') ELSE from_unixtime(unix_timestamp(regexp_replace(date,'[a-zA-Z] ',''),'yyyyMMdd') ,'yyyyMMdd') END else regexp_replace(date,'[^0-9]','') end ) AS dt from input").show(false)
CodePudding user response:
Try Below code.
scala> val df = Seq("2021dgsth02hdg02","0101def20dr21").toDF("dt")
df: org.apache.spark.sql.DataFrame = [dt: string]
scala> df.show(false)
----------------
|dt |
----------------
|2021dgsth02hdg02|
|0101def20dr21 |
----------------
scala> df
.withColumn("dt",regexp_replace($"dt","[a-zA-Z] ",""))
.withColumn("dt",
when(
to_date($"dt","yyyyMMdd").isNull,
to_date($"dt","MMddyyyy")
)
.otherwise(to_date($"dt","yyyyMMdd"))
).show(false)
----------
|dt |
----------
|2021-02-02|
|2021-01-01|
----------
// Entering paste mode (ctrl-D to finish)
spark.sql("""
select (
CASE WHEN to_date(regexp_replace(date,'[a-zA-Z] ',''),'yyyyMMdd') IS NULL
THEN to_date(regexp_replace(date,'[a-zA-Z] ',''),'MMddyyyy')
ELSE to_date(regexp_replace(date,'[a-zA-Z] ',''),'yyyyMMdd')
END
) AS dt from input
""")
.show(false)
// Exiting paste mode, now interpreting.
----------
|dt |
----------
|2021-02-02|
|2021-01-01|
----------