How to parse json schema with list of objects in spark streaming?-CodePudding

So I have data streaming in like the following:

{
"messageDetails":{
  "id": "1",
  "name": "2"
},
"messageMain":{
  "date": "string",
  "details": [{"val1":"abcd","val2":"efgh"},{"val1":"aaaa","val2":"bbbb"}]
}

Here is an example message. Normally, I would define a schema like the following:

    val tableSchema: StructType = (new StructType)
      .add("messageDetails", (new StructType)
        .add("id", StringType)
        .add("name", StringType))
      .add("messageMain", (new StructType)
        .add("date", StringType)
        .add("details", ???) ????)

Then read in the messages like so -

val df = spark.read.schema(tableSchema).json(rdd)

However, I am not sure how to define details as it's a list of objects and not a structtype. I do not want to simple explode the rows if there is another way.. because the end goal of this would be to write back to a google BigQuery table that sets details to a repeated record type.

CodePudding user response：

You want an ArrayType of StructType holding val1 and val2 StringType's

e.g.

val itemSchema = (new StructType)
        .add("val1", StringType)
        .add("val2", StringType)
val detailsSchema = new ArrayType(itemSchema, false)

val tableSchema: StructType = (new StructType)
  .add("messageDetails", (new StructType)
    .add("id", StringType)
    .add("name", StringType))
  .add("messageMain", (new StructType)
    .add("date", StringType)
    .add("details", detailsSchema))