I'm trying to change a String to timestamp, but in my zone, the last Sunday of March between 2:00 AM and 3:00 AM does not exist and returns null. Example:
scala> spark.sql("select to_timestamp('20220327 021500', 'yyyyMMdd HHmmss') from *").show(1)
--------------------------------------------------
|to_timestamp('20220327 021500', 'yyyyMMdd HHmmss')|
--------------------------------------------------
| null|
--------------------------------------------------
only showing top 1 row
scala> spark.sql("select to_timestamp('20220327 031500', 'yyyyMMdd HHmmss') from *").show(1)
--------------------------------------------------
|to_timestamp('20220327 031500', 'yyyyMMdd HHmmss')|
--------------------------------------------------
| 2022-03-27 03:15:00|
--------------------------------------------------
only showing top 1 row
A solution may be to add one hour between 2:00 AM and 3:00 AM for those days, but I don't know how to implement this solution
I can't change the data source.
What can I do?
Thanks
EDIT
The official documentation says:
In Spark 3.1, from_unixtime, unix_timestamp,to_unix_timestamp, to_timestamp and to_date will fail if the specified datetime pattern is invalid. In Spark 3.0 or earlier, they result NULL. (1)
CodePudding user response:
Let's consider the following dataframe with a column called ts
.
val df = Seq("20220327 021500", "20220327 031500", "20220327 011500").toDF("ts")
in spark 3.1 , we can use to_timestamp
which automatically adds one hour in your situation.
df.withColumn("time", to_timestamp($"ts", "yyyyMMdd HHmmss")).show
--------------- -------------------
| ts| time|
--------------- -------------------
|20220327 021500|2022-03-27 03:15:00|
|20220327 031500|2022-03-27 03:15:00|
|20220327 011500|2022-03-27 01:15:00|
--------------- -------------------
in spark 3.0 and 2.4.7, we obtain this:
df.withColumn("time", to_timestamp($"ts", "yyyyMMdd HHmmss")).show
--------------- -------------------
| ts| time|
--------------- -------------------
|20220327 021500| null|
|20220327 031500|2022-03-27 03:15:00|
|20220327 011500|2022-03-27 01:15:00|
--------------- -------------------
But strangely, in spark 2.4.7, to_utc_timestamp
works the same way as to_timestamp
in future versions. The only problem is that we can not use custom date format. Yet, if we convert the date ourselves, we can obtain this:
df.withColumn("ts", concat(
substring('ts, 0, 4), lit("-"),
substring('ts, 5, 2), lit("-"),
substring('ts, 7, 5), lit(":"),
substring('ts,12,2), lit(":"),
substring('ts,14,2))
)
.withColumn("time", to_utc_timestamp('ts, "UTC"))
.show
------------------- -------------------
| ts| time|
------------------- -------------------
|2022-03-27 02:15:00|2022-03-27 03:15:00|
|2022-03-27 03:15:00|2022-03-27 03:15:00|
|2022-03-27 01:15:00|2022-03-27 01:15:00|
------------------- -------------------
CodePudding user response:
I found two different solutions, in both solutions you have to change the String format to yyyy-MM-dd HH:mm:ss:
select cast(CONCAT(SUBSTR('20220327021000',0,4),'-',
SUBSTR('20220327021000',5,2),'-',
SUBSTR('20220327021000',7,2),' ',
SUBSTR('20220327021020',9,2),':',
SUBSTR('20220327021020',11,2),':',
SUBSTR('20220327021020',13,2)) as timestamp);
select
cast(
regexp_replace("20220327031005",
'^(\\d{4})(\\d{2})(\\d{2})(\\d{2})(\\d{2})(\\d{2})$','$1-$2-$3 $4:$5:$6'
) as timestamp);