I have a CSV file with a date column as shown below,
datecol
----------
2021-01-11
2021-02-15
2021-02-10
2021-04-22
If I read this file by enabling inferSchema
in spark version 2.4.5 I'm getting below schema,
root
|-- datecol: timestamp (nullable = true)
But in spark 3.1 below is the ouput.
root
|-- datecol: string (nullable = true)
I have checked migration guide from spark documentation but didn't get any information about this.
Could anyone please confirm if it's a bug or do I need to use some other configurations?
CodePudding user response:
This is an effect of the Spark migration to Java 8 new Date API since Spark 3 . You can read from the migration guide:
Parsing/formatting of timestamp/date strings. This effects on CSV/JSON datasources [...]. New implementation performs strict checking of its input. For example, the
2015-07-22 10:00:00
timestamp cannot be parse if pattern isyyyy-MM-dd
because the parser does not consume whole input. Another example is the31/01/2015 00:00
input cannot be parsed by thedd/MM/yyyy hh:mm
pattern becausehh
supposes hours in the range 1-12. In Spark version 2.4 and below, java.text.SimpleDateFormat is used for timestamp/date string conversions [...].
In fact, inferSchema
does not detect DateType
but only TimestampType
. And since by default in CSV Data Source, the parameter timestampFormat
is yyyy-MM-dd'T'HH:mm:ss[.SSS][XXX]
then it is not converted into timestamp for the reason cited above.
You can try to add the option when loading csv:
val df = spark.read.option("inferSchema", "true").option("timestampFormat", "yyyy-MM-dd").csv("/path/csv")