This is a follow up question from post. @abiratis thanks for your answer, we are trying implement the same in our glue jobs, the only change is that we don't have a static schema defined, so we have created a new column colSchema
to hold the schema of each entry of some-array
attribute. Which looks like this:
------------------------ -----------------------------------------------------------------------------------------------------------------------
|some-array |colSchema |
------------------------ -----------------------------------------------------------------------------------------------------------------------
|[{f1a, f2a}, {f1b, f2b}]|ArrayType(StructType(List(StructField(array-field-1,StringType,true),StructField(array-field-2,StringType,true))),true)|
------------------------ -----------------------------------------------------------------------------------------------------------------------
But while converting it to json format using from_json
i'm getting this error:
conversion is done this:
final_df.select(from_json(col('some-array'), 'ArrayType(StructType(List(StructField(array-field-1,StringType,true),
StructField(array-field-2,StringType,true))),true)' {'allowUnquotedFieldNames':True}).alias('json1')).show(3, False)
Error is :
AnalysisException: Cannot parse the schema in JSON format: Unrecognized token 'ArrayType': was expecting (JSON String, Number, Array, Object or token 'null', 'true' or 'false')
at [Source: (String)"ArrayType(StructType(List(StructField(array-field-1,StringType,true),StructField(array-field-2,StringType,true))),true)"; line: 1, column: 10]
Failed fallback parsing: Cannot parse the data type:
mismatched input 'StructType' expecting INTEGER_VALUE(line 1, pos 10)
== SQL ==
ArrayType(StructType(List(StructField(array-field-1,StringType,true),StructField(array-field-2,StringType,true))),true)
----------^^^
Any help would be highly appreciated.
CodePudding user response:
from the from_json
's documentation:
schema: DataType or str a StructType or ArrayType of StructType to use when parsing the json column.
Changed in version 2.3: the DDL-formatted string is also supported for schema.
The first parameter should be a json like column, which you have correct. The second parameter is either a DataType or a str formatted as a DDL string. This you got wrogn, since you are passing a DataType as a simple string. That isn't valid.
For your example, I think the correct definition would be simething like the following:
'ARRAY<STRUCT<`array-field-1`: STRING, `array-field-2`: STRING>>'
Therefore,
final_df.select(from_json(col('some-array'), 'ARRAY<STRUCT<`array-field-1`: STRING, `array-field-2`: STRING>>')
should work