Parse a specific timestamp in Scala-CodePudding

I have a csv file which I read in with scala and spark. In this data is a timecolumn, which contains timevalues as strings of the form

val myTimestamp = "2021-05-24 18:44:22.127631600 02:00"

I now need to parse this timestamp. Since I am having a Dataframe, I want to use the functionality of .withColumn and to_timestamp.

Samplecode:

import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions.{col, to_timestamp}

val spark:SparkSession = SparkSession.builder().master("local").getOrCreate()
  
val myTimestamp: String = "2021-05-24 18:44:22.127631600 02:00"
val myFormat: String = "yyyy-MM-dd HH:mm:ss"

import spark.sqlContext.implicits._

Seq(myTimestamp)
  .toDF("theTimestampColumn")
  .withColumn("parsedTime", to_timestamp(col("theTimestampColumn"),fmt = myFormat))
  .show()

Output:

 -------------------- ------------------- 
|  theTimestampColumn|         parsedTime|
 -------------------- ------------------- 
|2021-05-24 18:44:...|2021-05-24 18:44:22|
 -------------------- -------------------

Running this code works fine, but I restrict my timestamps to a second-precision. I want to have the whole precision with 9 fractions of the second. Therefore I read the documentation under https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html , but I wasn't able to set up the right number of S (tried with 1 to 9 S's) and X for specifiing the fractions of the second or the timezone, respectively. The parsedTime-column of the Dataframe becomes null. How do I parse this timestamp given the tools above?

I have for example also tried

val myFormat: String = "yyyy-MM-dd HH:mm:ss.SSSSSSSSSZ"
val myFormat: String = "yyyy-MM-dd HH:mm:ss.SSSSSSSSSX"
val myFormat: String = "yyyy-MM-dd HH:mm:ss.SSSSSSSSSXXXXXX"

with the original timestamp or

val myTimestamp: String = "2021-05-24 18:44:22.127631600"
val myFormat: String = "yyyy-MM-dd HH:mm:ss.SSSSSSSSS"

but converting yields a null-value.

Update: I just read that the fmt is optional. When leaving this out and calling to_timestamp(col("theTimestampColumn")) automatically parses the timestamp to 6 fractions.

CodePudding user response：

If your zone offset has a colon your format pattern should have 3 Xs or xs depending on whether your format uses Z or 00:00 for zero offset. Or 5 Xs or Zs to include optional seconds.

From the documentation:

Offset X and x: [...] Three letters outputs the hour and minute, with a colon, such as ‘ 01:30’ [...] Five letters outputs the hour and minute and optional second, with a colon, such as ‘ 01:30:15’ [...] Pattern letter ‘X’ (upper case) will output ‘Z’ when the offset to be output would be zero, whereas pattern letter ‘x’ (lower case) will output ‘ 00’, ‘ 0000’, or ‘ 00:00’.

Offset Z: [...] Five letters outputs the hour, minute, with optional second if non-zero, with colon. It outputs ‘Z’ if the offset is zero. [...]

So probably one of these should work for you:

val formatA = "yyyy-MM-dd HH:mm:ss.SSSSSSSSSxxx"
val formatB = "yyyy-MM-dd HH:mm:ss.SSSSSSSSSXXX"

Note that the docs also say

Spark supports datetime of micro-of-second precision, which has up to 6 significant digits, but can parse nano-of-second with exceeded part truncated.