I have a csv file which I read in with scala and spark. In this data is a timecolumn, which contains timevalues as strings of the form
val myTimestamp = "2021-05-24 18:44:22.127631600 02:00"
I now need to parse this timestamp. Since I am having a Dataframe, I want to use the functionality of .withColumn
and to_timestamp
.
Samplecode:
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions.{col, to_timestamp}
val spark:SparkSession = SparkSession.builder().master("local").getOrCreate()
val myTimestamp: String = "2021-05-24 18:44:22.127631600 02:00"
val myFormat: String = "yyyy-MM-dd HH:mm:ss"
import spark.sqlContext.implicits._
Seq(myTimestamp)
.toDF("theTimestampColumn")
.withColumn("parsedTime", to_timestamp(col("theTimestampColumn"),fmt = myFormat))
.show()
Output:
-------------------- -------------------
| theTimestampColumn| parsedTime|
-------------------- -------------------
|2021-05-24 18:44:...|2021-05-24 18:44:22|
-------------------- -------------------
Running this code works fine, but I restrict my timestamps to a second-precision. I want to have the whole precision with 9 fractions of the second. Therefore I read the documentation under https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html , but I wasn't able to set up the right number of S
(tried with 1 to 9 S's) and X
for specifiing the fractions of the second or the timezone, respectively. The parsedTime
-column of the Dataframe becomes null
. How do I parse this timestamp given the tools above?
I have for example also tried
val myFormat: String = "yyyy-MM-dd HH:mm:ss.SSSSSSSSSZ"
val myFormat: String = "yyyy-MM-dd HH:mm:ss.SSSSSSSSSX"
val myFormat: String = "yyyy-MM-dd HH:mm:ss.SSSSSSSSSXXXXXX"
with the original timestamp or
val myTimestamp: String = "2021-05-24 18:44:22.127631600"
val myFormat: String = "yyyy-MM-dd HH:mm:ss.SSSSSSSSS"
but converting yields a null
-value.
Update: I just read that the fmt
is optional. When leaving this out and calling to_timestamp(col("theTimestampColumn"))
automatically parses the timestamp to 6 fractions.
CodePudding user response:
If your zone offset has a colon your format pattern should have 3 Xs or xs depending on whether your format uses Z
or 00:00
for zero offset. Or 5 Xs or Zs to include optional seconds.
From the documentation:
Offset X and x: [...] Three letters outputs the hour and minute, with a colon, such as ‘ 01:30’ [...] Five letters outputs the hour and minute and optional second, with a colon, such as ‘ 01:30:15’ [...] Pattern letter ‘X’ (upper case) will output ‘Z’ when the offset to be output would be zero, whereas pattern letter ‘x’ (lower case) will output ‘ 00’, ‘ 0000’, or ‘ 00:00’.
Offset Z: [...] Five letters outputs the hour, minute, with optional second if non-zero, with colon. It outputs ‘Z’ if the offset is zero. [...]
So probably one of these should work for you:
val formatA = "yyyy-MM-dd HH:mm:ss.SSSSSSSSSxxx"
val formatB = "yyyy-MM-dd HH:mm:ss.SSSSSSSSSXXX"
Note that the docs also say
Spark supports datetime of micro-of-second precision, which has up to 6 significant digits, but can parse nano-of-second with exceeded part truncated.