Convert String to Timestamp in Spark (Hive) and the datetime is invalid-CodePudding

I'm trying to change a String to timestamp, but in my zone, the last Sunday of March between 2:00 AM and 3:00 AM does not exist and returns null. Example:

scala> spark.sql("select to_timestamp('20220327 021500', 'yyyyMMdd HHmmss') from *").show(1)
 -------------------------------------------------- 
|to_timestamp('20220327 021500', 'yyyyMMdd HHmmss')|
 -------------------------------------------------- 
|                                              null|
 -------------------------------------------------- 
only showing top 1 row


scala> spark.sql("select to_timestamp('20220327 031500', 'yyyyMMdd HHmmss') from *").show(1)
 -------------------------------------------------- 
|to_timestamp('20220327 031500', 'yyyyMMdd HHmmss')|
 -------------------------------------------------- 
|                               2022-03-27 03:15:00|
 -------------------------------------------------- 
only showing top 1 row

A solution may be to add one hour between 2:00 AM and 3:00 AM for those days, but I don't know how to implement this solution

I can't change the data source.

What can I do?

Thanks

EDIT

The official documentation says:

In Spark 3.1, from_unixtime, unix_timestamp,to_unix_timestamp, to_timestamp and to_date will fail if the specified datetime pattern is invalid. In Spark 3.0 or earlier, they result NULL. (1)

CodePudding user response：

Let's consider the following dataframe with a column called ts.

val df = Seq("20220327 021500", "20220327 031500", "20220327 011500").toDF("ts")

in spark 3.1 , we can use to_timestamp which automatically adds one hour in your situation.

df.withColumn("time", to_timestamp($"ts", "yyyyMMdd HHmmss")).show
 --------------- ------------------- 
|             ts|               time|
 --------------- ------------------- 
|20220327 021500|2022-03-27 03:15:00|
|20220327 031500|2022-03-27 03:15:00|
|20220327 011500|2022-03-27 01:15:00|
 --------------- -------------------

in spark 3.0 and 2.4.7, we obtain this:

df.withColumn("time", to_timestamp($"ts", "yyyyMMdd HHmmss")).show
 --------------- ------------------- 
|             ts|               time|
 --------------- ------------------- 
|20220327 021500|               null|
|20220327 031500|2022-03-27 03:15:00|
|20220327 011500|2022-03-27 01:15:00|
 --------------- -------------------

But strangely, in spark 2.4.7, to_utc_timestamp works the same way as to_timestamp in future versions. The only problem is that we can not use custom date format. Yet, if we convert the date ourselves, we can obtain this:

df.withColumn("ts", concat(
    substring('ts, 0, 4), lit("-"),
    substring('ts, 5, 2), lit("-"),
    substring('ts, 7, 5), lit(":"),
    substring('ts,12,2), lit(":"),
    substring('ts,14,2))
)
.withColumn("time", to_utc_timestamp('ts, "UTC"))
.show
 ------------------- ------------------- 
|                 ts|               time|
 ------------------- ------------------- 
|2022-03-27 02:15:00|2022-03-27 03:15:00|
|2022-03-27 03:15:00|2022-03-27 03:15:00|
|2022-03-27 01:15:00|2022-03-27 01:15:00|
 ------------------- -------------------

CodePudding user response：

I found two different solutions, in both solutions you have to change the String format to yyyy-MM-dd HH:mm:ss:

select cast(CONCAT(SUBSTR('20220327021000',0,4),'-',
  SUBSTR('20220327021000',5,2),'-',
  SUBSTR('20220327021000',7,2),' ', 
  SUBSTR('20220327021020',9,2),':',
  SUBSTR('20220327021020',11,2),':',
  SUBSTR('20220327021020',13,2)) as timestamp);

select 
cast(
  regexp_replace("20220327031005", 
    '^(\\d{4})(\\d{2})(\\d{2})(\\d{2})(\\d{2})(\\d{2})$','$1-$2-$3 $4:$5:$6'
) as timestamp);