Home > database >  DateType column read as StringType from CSV file even when appropriate schema provided
DateType column read as StringType from CSV file even when appropriate schema provided

Time:06-27

I am trying to read a CSV file using PySpark containing a DateType field in the format "dd/MM/yyyy". I have specified the field as DateType() in schema definition and also provided the option "dateFormat" in DataFrame CSV reader. However, the output dataframe after read is having the field as StringType() instead of DateType().

Sample input data:

"school_id","gender","class","doj"
"1","M","9","01/01/2020" 
"1","M","10","01/03/2018"
"1","F","10","01/04/2018"
"2","M","9","01/01/2019"
"2","F","10","01/01/2018"

My code:

from pyspark.sql.types import StructField, StructType, StringType, DateType
school_students_schema = StructType([StructField("school_id", StringType(),True) ,\
                             StructField("gender", StringType(),True) ,\
                             StructField("class", StringType(),True) ,\
                             StructField("doj", DateType(),True)    
                            ])

school_students_df = spark.read.format("csv") \
                           .option("header", True) \
                           .option("schema", school_students_schema) \
                           .option("dateFormat", "dd/MM/yyyy") \
                           .load("/user/test/school_students.csv")
school_students_df.printSchema()

Actual output after running the above (column doj parsed as string instead of the specified DateType and dateFormat without any exception).

root
|-- school_id: string (nullable = true)
|-- gender: string (nullable = true)
|-- class: string (nullable = true)
|-- doj: string (nullable = true)

Expected output:

root
|-- school_id: string (nullable = true)
|-- gender: string (nullable = true)
|-- class: string (nullable = true)
|-- doj: date (nullable = true)

Runtime environment

Databricks Community Edition
7.3 LTS (includes Apache Spark 3.0.1, Scala 2.12)

Requesting your help to understand:

  1. Why is the column being parsed as StringType even though DateType is mentioned in schema?
  2. What needs to be done in the code so that the column doj is parsed as DateType()?

CodePudding user response:

Need

.option("dateFormat", "some format") 

or the appropriate default format. Becomes stringtype if not correct format.

Only 1 date format possible this way btw. Otherwise in line manipulation.

CodePudding user response:

You should use

.schema(school_students_schema)

instead of

.option("schema", school_students_schema)

(There is no "schema" in the available option list.)

  • Related