parquet_path = /tmp/test-parquet
content of t2.json is:
{
"id": "OK_good2",
"some-array": [
{"array-field-1":"f1a","array-field-2":"f2a"},
{"array-field-1":"f1b","array-field-2":"f2b"}
]
}
creating dataframe from t2.json
df = spark.read.json('t2.json')
df = df.withColumn('some-array', col('some-array').cast('string'))
df.write.mode("overwrite").parquet(parquet_path)
formed schema from:
schema = dict(df.dtypes)['some-array'] # o/p array<struct<array-field-1:string,array-field-2:string>>
Reading from parquet_path
:
final_df = spark.read.parquet(parquet_path)
final_df.select('some-array').show(3, False)
------------------------
|some-array |
------------------------
|[{f1a, f2a}, {f1b, f2b}]|
------------------------
While trying to get the JSON schema of the same using from_json
its failing. Im not able to figure out why. Please provide some help.
final_df.select(from_json(col('some-array'), 'array<struct<array-field-1:string,array-field-2:string>>', {'allowUnquotedFieldNames':True}).
alias('json1')).show(2, False)
throws error:
AnalysisException: Cannot parse the schema in JSON format: Unrecognized token 'array': was expecting (JSON String, Number, Array, Object or token 'null', 'true' or 'false')
at [Source: (String)"array<struct<array-field-1:string,array-field-2:string>>"; line: 1, column: 6]
Failed fallback parsing: Cannot parse the data type:
In case anyone interested i'm trying to follow this post
CodePudding user response:
A few issues in your code.
- You are losing key data from the JSON in this line.
df = df.withColumn('some-array', col('some-array').cast('string'))
So, when you are reading the parquet file that is saved, you only see this.
------------------------
|some-array |
------------------------
|[{f1a, f2a}, {f1b, f2b}]|
------------------------
and NOT. (This is what you expect to have.)
-------------------------------------------------------
|some-array |
-------------------------------------------------------
|[{"array_field_1": "f1a", "array_field_2": "f2a"}, ...]|
-------------------------------------------------------
The proper way to cast the array structure to JSON string is to use to_json
.
df = df.withColumn('some-array', to_json('some-array'))
Please check df.show()
or df.take(1)
to see the difference.
- Your DDL string needs wrapping with (`) for column name. (ref: https://vincent.doba.fr/posts/20211004_spark_data_description_language_for_defining_spark_schema/)
'array<struct<`array-field-1`:string,`array-field-2`:string>>'