Home > Software design >  How to define a schema for json to be used in from_json to parse out values
How to define a schema for json to be used in from_json to parse out values

Time:10-23

I am trying to come up with a schema definition to parse out information from dataframe string column I am using from_json for that . I need help in defining schema which I am somehow not getting it right.

Here is the Json I have

[
 {
  "sectionid":"838096e332d4419191877a3fd40ed1f4",
  "sequence":0,
  "questions":[
     {
        "xid":"urn:com.mheducation.openlearning:lms.assessment.author:qastg.global:assessment_item:2a0f52fb93954f4590ac88d90888be7b",
        "questionid":"d36e1d7eeeae459c8db75c7d2dfd6ac6",
        "quizquestionid":"d36e1d7eeeae459c8db75c7d2dfd6ac6",
        "qtype":"3",
        "sequence":0,
        "subsectionsequence":-1,
        "type":"80",
        "question":"<p>This is a simple, 1 question assessment for automation testing</p>",
        "totalpoints":"5.0",
        "scoring":"1",
        "scoringrules":"{\"type\":\"perfect\",\"points\":5.0,\"pointsEach\":null,\"rules\":[]}",
        "inputoption":"0",
        "casesensitive":"0",
        "suggestedscoring":"1",
        "suggestedscoringrules":"{\"type\":\"perfect\",\"points\":5.0,\"pointsEach\":null,\"rules\":[]}",
        "answers":[
           "1"
        ],
        "options":[
           
        ]
     }
  ]
 }
]

I want to parse this information out which will result in columns sectionid , sequence, xid, question.sequence, question.question(question text), answers

Here is what I have I have defined a schema for testing like this

    import org.apache.spark.sql.types.{StringType, ArrayType, StructType, 
     StructField}
 val schema = new StructType()
.add("sectionid", StringType, true)
.add("sequence", StringType, true)
.add("questions", StringType, true)
.add("answers", StringType, true)

  finalDF = finalDF
  .withColumn( "parsed", from_json(col("enriched_payload.transformed"),schema) ) 

But I am getting NULL in result columns the reason I think is my schema is not right. I am struggling to come up with right definition . How do I come up with correct json schema definition ?

I am using spark 3.0

CodePudding user response:

Try below code.

import org.apache.spark.sql.types._

val schema = ArrayType(
    new StructType()
    .add("sectionid",StringType,true)
    .add("sequence",LongType,true)
    .add("questions", ArrayType(      
                        new StructType()
                            .add("answers",ArrayType(StringType,true),true)
                            .add("casesensitive",StringType,true)
                            .add("inputoption",StringType,true)
                            .add("options",ArrayType(StringType,true),true)
                            .add("qtype",StringType,true)
                            .add("question",StringType,true)
                            .add("questionid",StringType,true)
                            .add("quizquestionid",StringType,true)
                            .add("scoring",StringType,true)
                            .add("scoringrules",StringType,true)
                            .add("sequence",LongType,true)
                            .add("subsectionsequence",LongType,true)
                            .add("suggestedscoring",StringType,true)
                            .add("suggestedscoringrules",StringType,true)
                            .add("totalpoints",StringType,true)
                            .add("type",StringType,true)
                            .add("xid",StringType,true)
                        )
    )
) 
  • Related