Home > other >  Define spark schema for given json
Define spark schema for given json

Time:10-24

I load this json to Spark dataframe without specifying schema:

{
 "titles": {
  "L": [
   {
    "S": "ABC"
   }
  ]
 }
}

The result of df.printSchema() is

root
 |-- titles: struct (nullable = true)
 |    |-- L: array (nullable = true)
 |    |    |-- element: struct (containsNull = true)
 |    |    |    |-- S: string (nullable = true)

I tried and failed to translate this json schema to the code as below:

AS = StructType([StructField
  ("L",
    ArrayType(StructField("S", StringType(), True))
 )   
]) 

my_schema = StructType([
   StructField("titles", AS ,True)
])

I tried to use my_schema to read the same json and got the error:

"Failed to convert the JSON string '{"metadata":{},"name":"S","nullable":true,"type":"string"}' to a data type".

How to fix it?

CodePudding user response:

In the schema you defined you are missing a level.

The content of the array L should be a StructType which contains the StructField S. You are missing this StructType.

The correct schema is

my_schema = StructType([
    StructField("titles", StructType([
        StructField("L", ArrayType(
            StructType([
                StructField("S", StringType(), True)
            ])
        ))
    ]), True)
])

CodePudding user response:

After you've created the dataframe using JSON, just print the schema using print(df.schema) as:

df = spark.read.option("multiline","true").json("/content/sample_data/test.json")

print(df.schema)

[Out]:
StructType([StructField('titles', StructType([StructField('L', ArrayType(StructType([StructField('S', StringType(), True)]), True), True)]), True)])

The printed schema can be used "as is" to define the schema:

from pyspark.sql.types import StructType, StructField, ArrayType, StringType, Row

schema_2 = StructType([StructField('titles', StructType([StructField('L', ArrayType(StructType([StructField('S', StringType(), True)]), True), True)]), True)])

data = [Row(Row([Row("ABC")]))]

spark.createDataFrame(data=data, schema=schema_2).schema

[Out]:

root
 |-- titles: struct (nullable = true)
 |    |-- L: array (nullable = true)
 |    |    |-- element: struct (containsNull = true)
 |    |    |    |-- S: string (nullable = true)
  • Related