Home > database >  Spark DDL Schema JSON Struct
Spark DDL Schema JSON Struct

Time:02-28

Question

I am trying to define a nested .json schema in pyspark, but cannot get the ddl_schema string to work.

Usually in SQL this would be ROW, I have tried STRUCT below but can't get the data type correct this is the error...

ParseException: 
mismatched input '(' expecting {<EOF>, ',', 'COMMENT', NOT}(line 6, pos 15)

== SQL ==

    driverId INT,
    driverRef STRING,
    number STRING,
    code STRING,
    name STRUCT(forename STRING, surname STRING),
---------------^^^
    dob DATE,
    nationality STRING,
    url STRING

Data Sample

             -------- ---------- ------ ---- -------------------- ---------- ----------- -------------------- 
            |driverId| driverRef|number|code|                name|       dob|nationality|                 url|
             -------- ---------- ------ ---- -------------------- ---------- ----------- -------------------- 
            |       1|  hamilton|    44| HAM|   {Lewis, Hamilton}|1985-01-07|    British|http://en.wikiped...|

Code Sample

        mnt = "/mnt/dev/root"
        env = "raw"
        path = "formula1/drivers"
        fileFormat = "json"
        
        inPath = f"{mnt}/{env.upper()}/{path}.{fileFormat}"
        
        
        options = {'header': 'True'}
        
        ddl_schema = """
            driverId INT,
            driverRef STRING,
            number STRING,
            code STRING,
            name STRUCT(forename STRING, surname STRING),
            dob DATE,
            nationality STRING,
            url STRING
        """
        
        drivers_df = (spark
                       .read
                       .options(**options)
                       .schema(ddl_schema)
                       .format(fileFormat)
                       .load(inPath)
                     )

CodePudding user response:

You are using the wrong syntax for STRUCT.
Here is the right one:

name STRUCT<forename:STRING,surname:STRING>

https://spark.apache.org/docs/latest/sql-ref-datatypes.html
(search for Complex types and choose the SQL tab)

Data type SQL name
BooleanType BOOLEAN
ByteType BYTE, TINYINT
ShortType SHORT, SMALLINT
IntegerType INT, INTEGER
LongType LONG, BIGINT
FloatType FLOAT, REAL
DoubleType DOUBLE
DateType DATE
TimestampType TIMESTAMP
StringType STRING
BinaryType BINARY
DecimalType DECIMAL, DEC, NUMERIC
YearMonthIntervalType INTERVAL YEAR, INTERVAL YEAR TO MONTH, INTERVAL MONTH
DayTimeIntervalType INTERVAL DAY, INTERVAL DAY TO HOUR, INTERVAL DAY TO MINUTE, INTERVAL DAY TO SECOND, INTERVAL HOUR, INTERVAL HOUR TO MINUTE, INTERVAL HOUR TO SECOND, INTERVAL MINUTE, INTERVAL MINUTE TO SECOND, INTERVAL SECOND
ArrayType ARRAY<element_type>
StructType STRUCT<field1_name: field1_type, field2_name: field2_type, …> Note: ‘:’ is optional.
MapType MAP<key_type, value_type>
  • Related