pyspark from_json is failing with error: Cannot parse the schema in JSON format: Unrecognized token-CodePudding

parquet_path = /tmp/test-parquet

content of t2.json is:

{
   "id": "OK_good2", 
   "some-array": [
      {"array-field-1":"f1a","array-field-2":"f2a"},
      {"array-field-1":"f1b","array-field-2":"f2b"}
   ]
}

creating dataframe from t2.json

df = spark.read.json('t2.json')
df = df.withColumn('some-array', col('some-array').cast('string'))
df.write.mode("overwrite").parquet(parquet_path)

formed schema from:

schema = dict(df.dtypes)['some-array'] # o/p array<struct<array-field-1:string,array-field-2:string>>

Reading from parquet_path:

final_df = spark.read.parquet(parquet_path)
final_df.select('some-array').show(3, False)

 ------------------------ 
|some-array              |
 ------------------------ 
|[{f1a, f2a}, {f1b, f2b}]|
 ------------------------

While trying to get the JSON schema of the same using from_json its failing. Im not able to figure out why. Please provide some help.

final_df.select(from_json(col('some-array'), 'array<struct<array-field-1:string,array-field-2:string>>', {'allowUnquotedFieldNames':True}).
                alias('json1')).show(2, False)

throws error:

AnalysisException: Cannot parse the schema in JSON format: Unrecognized token 'array': was expecting (JSON String, Number, Array, Object or token 'null', 'true' or 'false')
 at [Source: (String)"array<struct<array-field-1:string,array-field-2:string>>"; line: 1, column: 6]
Failed fallback parsing: Cannot parse the data type:

In case anyone interested i'm trying to follow this post

CodePudding user response：

A few issues in your code.

You are losing key data from the JSON in this line.

df = df.withColumn('some-array', col('some-array').cast('string'))

So, when you are reading the parquet file that is saved, you only see this.

 ------------------------ 
|some-array              |
 ------------------------ 
|[{f1a, f2a}, {f1b, f2b}]|
 ------------------------

and NOT. (This is what you expect to have.)

 ------------------------------------------------------- 
|some-array                                             |
 ------------------------------------------------------- 
|[{"array_field_1": "f1a", "array_field_2": "f2a"}, ...]|
 -------------------------------------------------------

The proper way to cast the array structure to JSON string is to use to_json.

df = df.withColumn('some-array', to_json('some-array'))

Please check df.show() or df.take(1) to see the difference.

Your DDL string needs wrapping with (`) for column name. (ref: https://vincent.doba.fr/posts/20211004_spark_data_description_language_for_defining_spark_schema/)

'array<struct<`array-field-1`:string,`array-field-2`:string>>'