Question
I am trying to define a nested .json schema in pyspark, but cannot get the ddl_schema string to work.
Usually in SQL this would be ROW, I have tried STRUCT below but can't get the data type correct this is the error...
ParseException:
mismatched input '(' expecting {<EOF>, ',', 'COMMENT', NOT}(line 6, pos 15)
== SQL ==
driverId INT,
driverRef STRING,
number STRING,
code STRING,
name STRUCT(forename STRING, surname STRING),
---------------^^^
dob DATE,
nationality STRING,
url STRING
Data Sample
-------- ---------- ------ ---- -------------------- ---------- ----------- --------------------
|driverId| driverRef|number|code| name| dob|nationality| url|
-------- ---------- ------ ---- -------------------- ---------- ----------- --------------------
| 1| hamilton| 44| HAM| {Lewis, Hamilton}|1985-01-07| British|http://en.wikiped...|
Code Sample
mnt = "/mnt/dev/root"
env = "raw"
path = "formula1/drivers"
fileFormat = "json"
inPath = f"{mnt}/{env.upper()}/{path}.{fileFormat}"
options = {'header': 'True'}
ddl_schema = """
driverId INT,
driverRef STRING,
number STRING,
code STRING,
name STRUCT(forename STRING, surname STRING),
dob DATE,
nationality STRING,
url STRING
"""
drivers_df = (spark
.read
.options(**options)
.schema(ddl_schema)
.format(fileFormat)
.load(inPath)
)
CodePudding user response:
You are using the wrong syntax for STRUCT.
Here is the right one:
name STRUCT<forename:STRING,surname:STRING>
https://spark.apache.org/docs/latest/sql-ref-datatypes.html
(search for Complex types
and choose the SQL tab)
Data type | SQL name |
---|---|
BooleanType | BOOLEAN |
ByteType | BYTE, TINYINT |
ShortType | SHORT, SMALLINT |
IntegerType | INT, INTEGER |
LongType | LONG, BIGINT |
FloatType | FLOAT, REAL |
DoubleType | DOUBLE |
DateType | DATE |
TimestampType | TIMESTAMP |
StringType | STRING |
BinaryType | BINARY |
DecimalType | DECIMAL, DEC, NUMERIC |
YearMonthIntervalType | INTERVAL YEAR, INTERVAL YEAR TO MONTH, INTERVAL MONTH |
DayTimeIntervalType | INTERVAL DAY, INTERVAL DAY TO HOUR, INTERVAL DAY TO MINUTE, INTERVAL DAY TO SECOND, INTERVAL HOUR, INTERVAL HOUR TO MINUTE, INTERVAL HOUR TO SECOND, INTERVAL MINUTE, INTERVAL MINUTE TO SECOND, INTERVAL SECOND |
ArrayType | ARRAY<element_type> |
StructType | STRUCT<field1_name: field1_type, field2_name: field2_type, …> Note: ‘:’ is optional. |
MapType | MAP<key_type, value_type> |