Home > front end >  Defining Schema for json data in Pyspark
Defining Schema for json data in Pyspark

Time:12-23

I have a json as below stored in amazon s3 in json files.

{
    "owner": "mindey",
    "ownerId": "4e751a",
    "batch": "test1",
    "mode": "testserve",
    "subType": "self",
    "versionId": "v89.0563",
    "createdDate": "2021-12-10T00:41:02.652Z",
    "modifiedDate": "2021-12-10T00:41:02.652Z",
    "createdBy": "[email protected]",
    "modifiedBy": "[email protected]",
    "numChanges": 10,
    "progressFlag": true,
    "auditing": {
        "hack": "TBD",
        "facetCd": [
            "652439",
            "A97341"
        ],
        "modelColors": [
            "083y32-058",
            "978t22-456",
            "A097ee-009"
        ],
        "consumersGroup": [
            "25a9874-098a-08nh-4424-90838e"
        ],
        "cnsmrTargets": [
            {
                "id": "7jsfg8734-038h-9825-9784-37768ebe6ba1",
                "type": "self"
            }
        ],
        "modelGroups": [
            {
                "seedIds": [
                    "7kjshf99-754a-gf78-a015",
                    "6d874mq3-23fa-4e76-baad"
                ],
                "exact": true,
                "position": 5
            }
        ],
        "txGroups": [
            {
                "txIds": [
                    "7bw97862-aca7-9873",
                    "5ddfb803-2f11-8763"
                ]
            }
        ],
        "shops": [
            "ue749nt2-acc6-2452",
            "ue749nt2-982y-4286"
        ]
    }
}

Now, I read the files and wanted defined a schema in pyspark to append the data to a table. So I defined my schema as below and read it through a temporary dataframe in pyspark as below. The code didn't error out, however I was looking to get a different result which is also mentioned below

sch = StructType([StructField("owner",StringType()), 
        StructField("ownerId",StringType()), 
        StructField("batch",StringType()), 
        StructField("mode",StringType()), 
        StructField("subType",StringType()), 
        StructField("versionId",StringType()), 
        StructField("createdDate",StringType()), 
        StructField("modifiedDate",StringType()), 
        StructField("createdBy",StringType()), 
        StructField("modifiedBy",StringType()), 
        StructField("numChanges",StringType()), 
        StructField("progressFlag",StringType()), 
        StructField("auditing",StringType())])


df_read = spark.read.json(src,schema=sch)
df_read.createOrReplaceTempView('df')
spark.sql("select * from df").show(10,False)

 ------ ------- ----- ---- ------- --------- ------------------------ ------------------------ ------------ ------------ ---------- ------------ ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 
|owner |ownerId|batch|mode|subType|versionId|createdDate             |modifiedDate            |createdBy   |modifiedBy  |numChanges|progressFlag|auditing                                                                                                                                                                                                                                                                                                                                                                                                                                                               |
 ------ ------- ----- ---- ------- --------- ------------------------ ------------------------ ------------ ------------ ---------- ------------ ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 
|mindey|4e751a |test1|null|self   |v89.0563 |2021-12-10T00:41:02.652Z|2021-12-10T00:41:02.652Z|[email protected]|[email protected]|10        |true        |{"hack":"TBD","facetCd":["652439","A97341"],"modelColors":["083y32-058","978t22-456","A097ee-009"],"consumersGroup":["25a9874-098a-08nh-4424-90838e"],"cnsmrTargets":[{"id":"7jsfg8734-038h-9825-9784-37768ebe6ba1","type":"self"}],"modelGroups":[{"seedIds":["7kjshf99-754a-gf78-a015","6d874mq3-23fa-4e76-baad"],"exact":true,"position":5}],"txGroups":[{"txIds":["7bw97862-aca7-9873","5ddfb803-2f11-8763"]}],"shops":["ue749nt2-acc6-2452","ue749nt2-982y-4286"]}|
 ------ ------- ----- ---- ------- --------- ------------------------ ------------------------ ------------ ------------ ---------- ------------ ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 

I expect the output to be as below where the nested json is split to different columns appropriately -



 ------ ------- ----- ---- ------- --------- ------------------------ ------------------------ ------------ ------------ ---------- ------------ ------------- -------------------- -------------------------------------------- -------------------------------------------- -------------------------------------------- -------------------------------------------- -------------------------------------------------------- -------------------------------------------------------- ------------------------------------------------- 
|owner |ownerId|batch|mode|subType|versionId|createdDate             |modifiedDate            |createdBy   |modifiedBy  |numChanges|progressFlag|auditing_hack|auditing_facetCd    |auditing_modelColors                        |auditing_consumersGroup                     |auditing_cnsmrTargets_id                    |auditing_cnsmrTargets_type                  |auditing_modelGroups_seedIds                            |auditing_txGroups_txIds                                 |auditing_txGroups_shops                          |
 ------ ------- ----- ---- ------- --------- ------------------------ ------------------------ ------------ ------------ ---------- ------------ ------------- -------------------- -------------------------------------------- -------------------------------------------- -------------------------------------------- -------------------------------------------- -------------------------------------------------------- -------------------------------------------------------- ------------------------------------------------- 
|mindey|4e751a |test1|null|self   |v89.0563 |2021-12-10T00:41:02.652Z|2021-12-10T00:41:02.652Z|[email protected]|[email protected]|10        |true        |TBD          |["652439","A97341"] |["083y32-058","978t22-456","A097ee-009"]    |25a9874-098a-08nh-4424-90838e               |7jsfg8734-038h-9825-9784-37768ebe6ba1       |self                                        |["7kjshf99-754a-gf78-a015","6d874mq3-23fa-4e76-baad"]   |["7bw97862-aca7-9873","5ddfb803-2f11-8763"]             |["ue749nt2-acc6-2452","ue749nt2-982y-4286"]}     |
 ------ ------- ----- ---- ------- --------- ------------------------ ------------------------ ------------ ------------ ---------- ------------ ------------- -------------------- -------------------------------------------- -------------------------------------------- -------------------------------------------- -------------------------------------------- -------------------------------------------------------- -------------------------------------------------------- ------------------------------------------------- 

Can someone help me with this result. Thanks in advance!!

CodePudding user response:

The schema object should look like this:

sch = StructType([
  StructField("owner",StringType()), 
  StructField("ownerId",StringType()), 
  StructField("batch",StringType()), 
  StructField("mode",StringType()), 
  StructField("subType",StringType()), 
  StructField("versionId",StringType()), 
  StructField("createdDate",StringType()), 
  StructField("modifiedDate",StringType()), 
  StructField("createdBy",StringType()), 
  StructField("modifiedBy",StringType()), 
  StructField("numChanges",StringType()), 
  StructField("progressFlag",StringType()), 
  StructField("auditing",StructType([
    StructField('hack', StringType()),
    StructField('facetCd', ArrayType(StringType())),
    StructField('modelColors', ArrayType(StringType())),
    StructField('consumersGroup', ArrayType(StringType())),
    StructField('cnsmrTargets', ArrayType(StringType())),
    StructField('modelGroups', ArrayType(StructType([
      StructField('seedIds', ArrayType(StringType())),
      StructField('exact', StringType()),
      StructField('position', IntegerType()),
    ]))),
    StructField('facetCd', ArrayType(StructType([
      StructField('txIds', ArrayType(StringType())),
    ]))),
    StructField('shops', ArrayType(StringType())),
  ]))
 ]
)

Once you apply the schema, you'll get a nested struct for auditing column. Then a combination of select statement and alias can get you the desired column names.

  • Related