PySpark Cannot parse the schema in JSON format: Unrecognized token 'ArrayType': was expect-CodePudding

This is a follow up question from post. @abiratis thanks for your answer, we are trying implement the same in our glue jobs, the only change is that we don't have a static schema defined, so we have created a new column colSchema to hold the schema of each entry of some-array attribute. Which looks like this:

 ------------------------ ----------------------------------------------------------------------------------------------------------------------- 
|some-array              |colSchema                                                                                                              |
 ------------------------ ----------------------------------------------------------------------------------------------------------------------- 
|[{f1a, f2a}, {f1b, f2b}]|ArrayType(StructType(List(StructField(array-field-1,StringType,true),StructField(array-field-2,StringType,true))),true)|
 ------------------------ -----------------------------------------------------------------------------------------------------------------------

But while converting it to json format using from_json i'm getting this error:

conversion is done this:

final_df.select(from_json(col('some-array'), 'ArrayType(StructType(List(StructField(array-field-1,StringType,true),
StructField(array-field-2,StringType,true))),true)' {'allowUnquotedFieldNames':True}).alias('json1')).show(3, False)

Error is :

AnalysisException: Cannot parse the schema in JSON format: Unrecognized token 'ArrayType': was expecting (JSON String, Number, Array, Object or token 'null', 'true' or 'false')
 at [Source: (String)"ArrayType(StructType(List(StructField(array-field-1,StringType,true),StructField(array-field-2,StringType,true))),true)"; line: 1, column: 10]
Failed fallback parsing: Cannot parse the data type: 
mismatched input 'StructType' expecting INTEGER_VALUE(line 1, pos 10)

== SQL ==
ArrayType(StructType(List(StructField(array-field-1,StringType,true),StructField(array-field-2,StringType,true))),true)
----------^^^

Any help would be highly appreciated.

CodePudding user response：

from the from_json's documentation:

schema: DataType or str a StructType or ArrayType of StructType to use when parsing the json column.

Changed in version 2.3: the DDL-formatted string is also supported for schema.

The first parameter should be a json like column, which you have correct. The second parameter is either a DataType or a str formatted as a DDL string. This you got wrogn, since you are passing a DataType as a simple string. That isn't valid.

For your example, I think the correct definition would be simething like the following:

'ARRAY<STRUCT<`array-field-1`: STRING, `array-field-2`: STRING>>'

Therefore,

final_df.select(from_json(col('some-array'), 'ARRAY<STRUCT<`array-field-1`: STRING, `array-field-2`: STRING>>')

should work