So I have data streaming in like the following:
{
"messageDetails":{
"id": "1",
"name": "2"
},
"messageMain":{
"date": "string",
"details": [{"val1":"abcd","val2":"efgh"},{"val1":"aaaa","val2":"bbbb"}]
}
Here is an example message. Normally, I would define a schema like the following:
val tableSchema: StructType = (new StructType)
.add("messageDetails", (new StructType)
.add("id", StringType)
.add("name", StringType))
.add("messageMain", (new StructType)
.add("date", StringType)
.add("details", ???) ????)
Then read in the messages like so -
val df = spark.read.schema(tableSchema).json(rdd)
However, I am not sure how to define details
as it's a list of objects and not a structtype. I do not want to simple explode the rows if there is another way.. because the end goal of this would be to write back to a google BigQuery table that sets details
to a repeated record type
.
CodePudding user response:
You want an ArrayType
of StructType
holding val1
and val2
StringType
's
e.g.
val itemSchema = (new StructType)
.add("val1", StringType)
.add("val2", StringType)
val detailsSchema = new ArrayType(itemSchema, false)
val tableSchema: StructType = (new StructType)
.add("messageDetails", (new StructType)
.add("id", StringType)
.add("name", StringType))
.add("messageMain", (new StructType)
.add("date", StringType)
.add("details", detailsSchema))