Home > Net >  How to create a schema for an array of structs in Spark?
How to create a schema for an array of structs in Spark?

Time:07-02

I am trying to create the schema for the following data:

[{"a": "6"},{"b":"7"}]

My code so far

val data = Seq("""[{"a": "6"},{"b":"7"}]""").toDF()
val schema = StructType(Seq(
               StructField("colA", StructType(
                 Seq(StructField("a", StringType, true))
               ), true),
              StructField("colB", StructType(
                Seq(StructField("b", StringType, true))
              ), true)
             ))
val res = data.withColumn("res", from_json(col("value"), schema))

When I do res.show(), it returns null. Any ideas anyone?

CodePudding user response:

You get null, because provided schema doesn't fit the data.

For your data you can nicely create a schema of array of maps:

val res = data.withColumn("res", from_json($"value", ArrayType(MapType(StringType,StringType))))
res.show()
//  -------------------- -------------------- 
// |               value|                 res|
//  -------------------- -------------------- 
// |[{"a": "6"},{"b":...|[{a -> 6}, {b -> 7}]|
//  -------------------- -------------------- 

I don't think you can create an array of structs out of this data. You could, if you had only one dictionary inside the array with different keys or two dicts with same keys. But now you have 2 dicts, both with different keys. Arrays can only store one data type. Since you have 2 different dictionaries, this would require defining a different struct inside the array, what is not possible, arrays can hold only one data type.

Best what I can think of, is create an array of structs, but both structs would have both fields (a and b). But then one of them will be empty in both structs.

val res = data.withColumn("res", from_json($"value",ArrayType(StructType(Seq(StructField("a",StringType,true),StructField("b",StringType,true))))))
res.show(truncate=false)
//  ---------------------- ---------------------- 
// |value                 |res                   |
//  ---------------------- ---------------------- 
// |[{"a": "6"},{"b":"7"}]|[{6, null}, {null, 7}]|
//  ---------------------- ---------------------- 

CodePudding user response:

By default JSON data source can infer schema from an input file using the default inferschema option. So if you have the json stored in a file, you don't need to define it yourself:

val df = spark.read.json("<PATH_to_JSON_File>", multiLine = "true")

You'll need to use multiLine = "true" only if your json record spans multiple lines. Example:

//read json file into dataframe
val df = spark.read.json("src/main/resources/myJsonFile.json")
df.printSchema()
df.show(false)

You can still define the schema yourself, but note that your array is not consistent. The keys inside have different names, but they are different elements too. You can't define 2 elements with different keys in an enclosing array in your json. It's like defining 2 elements of different types.

So I think your json should look like this : ":[{"a":"6","b":7}]" in which case, your schema should be:

val schema = ArrayType(StructType(Array(
               (StructField("a", StringType, true),
               (StructField("b", StringType, true)
              )), false)

The false field means your array cannot accept null elements, but can still put null into the columns.

  • Related