I load this json to Spark dataframe without specifying schema:
{
"titles": {
"L": [
{
"S": "ABC"
}
]
}
}
The result of df.printSchema() is
root
|-- titles: struct (nullable = true)
| |-- L: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- S: string (nullable = true)
I tried and failed to translate this json schema to the code as below:
AS = StructType([StructField
("L",
ArrayType(StructField("S", StringType(), True))
)
])
my_schema = StructType([
StructField("titles", AS ,True)
])
I tried to use my_schema to read the same json and got the error:
"Failed to convert the JSON string '{"metadata":{},"name":"S","nullable":true,"type":"string"}' to a data type".
How to fix it?
CodePudding user response:
In the schema you defined you are missing a level.
The content of the array L
should be a StructType which contains the StructField S
. You are missing this StructType.
The correct schema is
my_schema = StructType([
StructField("titles", StructType([
StructField("L", ArrayType(
StructType([
StructField("S", StringType(), True)
])
))
]), True)
])
CodePudding user response:
After you've created the dataframe using JSON, just print the schema using print(df.schema)
as:
df = spark.read.option("multiline","true").json("/content/sample_data/test.json")
print(df.schema)
[Out]:
StructType([StructField('titles', StructType([StructField('L', ArrayType(StructType([StructField('S', StringType(), True)]), True), True)]), True)])
The printed schema can be used "as is" to define the schema:
from pyspark.sql.types import StructType, StructField, ArrayType, StringType, Row
schema_2 = StructType([StructField('titles', StructType([StructField('L', ArrayType(StructType([StructField('S', StringType(), True)]), True), True)]), True)])
data = [Row(Row([Row("ABC")]))]
spark.createDataFrame(data=data, schema=schema_2).schema
[Out]:
root
|-- titles: struct (nullable = true)
| |-- L: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- S: string (nullable = true)