Issue with date and inferSchema option in spark 3.1-CodePudding

I have a CSV file with a date column as shown below,

datecol
----------
2021-01-11
2021-02-15
2021-02-10
2021-04-22

If I read this file by enabling inferSchema in spark version 2.4.5 I'm getting below schema,

root
 |-- datecol: timestamp (nullable = true)

But in spark 3.1 below is the ouput.

root
 |-- datecol: string (nullable = true)

I have checked migration guide from spark documentation but didn't get any information about this.

Could anyone please confirm if it's a bug or do I need to use some other configurations?

CodePudding user response：

This is an effect of the Spark migration to Java 8 new Date API since Spark 3 . You can read from the migration guide:

Parsing/formatting of timestamp/date strings. This effects on CSV/JSON datasources [...]. New implementation performs strict checking of its input. For example, the 2015-07-22 10:00:00 timestamp cannot be parse if pattern is yyyy-MM-dd because the parser does not consume whole input. Another example is the 31/01/2015 00:00 input cannot be parsed by the dd/MM/yyyy hh:mm pattern because hh supposes hours in the range 1-12. In Spark version 2.4 and below, java.text.SimpleDateFormat is used for timestamp/date string conversions [...].

In fact, inferSchema does not detect DateType but only TimestampType. And since by default in CSV Data Source, the parameter timestampFormat is yyyy-MM-dd'T'HH:mm:ss[.SSS][XXX] then it is not converted into timestamp for the reason cited above.

You can try to add the option when loading csv:

val df = spark.read.option("inferSchema", "true").option("timestampFormat", "yyyy-MM-dd").csv("/path/csv")