Home > Back-end >  How to add Date validation in regexp of Spark-Sql
How to add Date validation in regexp of Spark-Sql

Time:10-24

spark.sql("select case when length(regexp_replace(date,'[^0-9]', ''))==8 then regexp_replace(date,'[^0-9]', '') else regexp_replace(date,'[^0-9]','') end as date from input").show(false)

In the above I need to add the requirements such as,

1.the output should be validated with the format 'yyyymmdd' using unix_timestamp.

  1. if it is not valid then should transform the extracted digits string by moving the first four (4) characters to the end of the extracted digits string (MMDDYYYY to YYYYMMDD) and then should be validated with the 'yyyymmdd' format, if this condition is satisfied then print that date.

I'm not sure how to include the Unix timestamp in my query.

Sample input and output 1:

input: 2021dgsth02hdg02
output: 20210202

Sample input and output 2:

input: 0101def20dr21 (note: MMDDYYYY TO YYYYMMDD)
output: 20210101

Using unix_timestamp in place of to_date

spark.sql("select (case when length(regexp_replace(date,'[^0-9]', ''))==8 then CASE WHEN from_unixtime(unix_timestamp(regexp_replace(date,'[a-zA-Z] ',''),'yyyyMMdd') ,'yyyyMMdd') IS NULL THEN from_unixtime(unix_timestamp(regexp_replace(date,'[a-zA-Z] ',''),'MMddyyyy') ,'MMddyyyy') ELSE from_unixtime(unix_timestamp(regexp_replace(date,'[a-zA-Z] ',''),'yyyyMMdd') ,'yyyyMMdd') END else regexp_replace(date,'[^0-9]','') end ) AS dt from input").show(false)

CodePudding user response:

Try Below code.

scala> val df = Seq("2021dgsth02hdg02","0101def20dr21").toDF("dt")
df: org.apache.spark.sql.DataFrame = [dt: string]
scala> df.show(false)
 ---------------- 
|dt              |
 ---------------- 
|2021dgsth02hdg02|
|0101def20dr21   |
 ---------------- 
scala> df
.withColumn("dt",regexp_replace($"dt","[a-zA-Z] ",""))
.withColumn("dt",
    when(
        to_date($"dt","yyyyMMdd").isNull,
        to_date($"dt","MMddyyyy")
    )
    .otherwise(to_date($"dt","yyyyMMdd"))
).show(false)
 ---------- 
|dt        |
 ---------- 
|2021-02-02|
|2021-01-01|
 ---------- 

// Entering paste mode (ctrl-D to finish)

spark.sql("""
    select (
        CASE WHEN to_date(regexp_replace(date,'[a-zA-Z] ',''),'yyyyMMdd') IS NULL
             THEN to_date(regexp_replace(date,'[a-zA-Z] ',''),'MMddyyyy')
             ELSE to_date(regexp_replace(date,'[a-zA-Z] ',''),'yyyyMMdd')
        END
    ) AS dt from input
""")
.show(false)

// Exiting paste mode, now interpreting.
 ---------- 
|dt        |
 ---------- 
|2021-02-02|
|2021-01-01|
 ---------- 
  • Related