I have a json as below stored in amazon s3 in json files.
{
"owner": "mindey",
"ownerId": "4e751a",
"batch": "test1",
"mode": "testserve",
"subType": "self",
"versionId": "v89.0563",
"createdDate": "2021-12-10T00:41:02.652Z",
"modifiedDate": "2021-12-10T00:41:02.652Z",
"createdBy": "[email protected]",
"modifiedBy": "[email protected]",
"numChanges": 10,
"progressFlag": true,
"auditing": {
"hack": "TBD",
"facetCd": [
"652439",
"A97341"
],
"modelColors": [
"083y32-058",
"978t22-456",
"A097ee-009"
],
"consumersGroup": [
"25a9874-098a-08nh-4424-90838e"
],
"cnsmrTargets": [
{
"id": "7jsfg8734-038h-9825-9784-37768ebe6ba1",
"type": "self"
}
],
"modelGroups": [
{
"seedIds": [
"7kjshf99-754a-gf78-a015",
"6d874mq3-23fa-4e76-baad"
],
"exact": true,
"position": 5
}
],
"txGroups": [
{
"txIds": [
"7bw97862-aca7-9873",
"5ddfb803-2f11-8763"
]
}
],
"shops": [
"ue749nt2-acc6-2452",
"ue749nt2-982y-4286"
]
}
}
Now, I read the files and wanted defined a schema in pyspark to append the data to a table. So I defined my schema as below and read it through a temporary dataframe in pyspark as below. The code didn't error out, however I was looking to get a different result which is also mentioned below
sch = StructType([StructField("owner",StringType()),
StructField("ownerId",StringType()),
StructField("batch",StringType()),
StructField("mode",StringType()),
StructField("subType",StringType()),
StructField("versionId",StringType()),
StructField("createdDate",StringType()),
StructField("modifiedDate",StringType()),
StructField("createdBy",StringType()),
StructField("modifiedBy",StringType()),
StructField("numChanges",StringType()),
StructField("progressFlag",StringType()),
StructField("auditing",StringType())])
df_read = spark.read.json(src,schema=sch)
df_read.createOrReplaceTempView('df')
spark.sql("select * from df").show(10,False)
------ ------- ----- ---- ------- --------- ------------------------ ------------------------ ------------ ------------ ---------- ------------ -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
|owner |ownerId|batch|mode|subType|versionId|createdDate |modifiedDate |createdBy |modifiedBy |numChanges|progressFlag|auditing |
------ ------- ----- ---- ------- --------- ------------------------ ------------------------ ------------ ------------ ---------- ------------ -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
|mindey|4e751a |test1|null|self |v89.0563 |2021-12-10T00:41:02.652Z|2021-12-10T00:41:02.652Z|[email protected]|[email protected]|10 |true |{"hack":"TBD","facetCd":["652439","A97341"],"modelColors":["083y32-058","978t22-456","A097ee-009"],"consumersGroup":["25a9874-098a-08nh-4424-90838e"],"cnsmrTargets":[{"id":"7jsfg8734-038h-9825-9784-37768ebe6ba1","type":"self"}],"modelGroups":[{"seedIds":["7kjshf99-754a-gf78-a015","6d874mq3-23fa-4e76-baad"],"exact":true,"position":5}],"txGroups":[{"txIds":["7bw97862-aca7-9873","5ddfb803-2f11-8763"]}],"shops":["ue749nt2-acc6-2452","ue749nt2-982y-4286"]}|
------ ------- ----- ---- ------- --------- ------------------------ ------------------------ ------------ ------------ ---------- ------------ -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
I expect the output to be as below where the nested json is split to different columns appropriately -
------ ------- ----- ---- ------- --------- ------------------------ ------------------------ ------------ ------------ ---------- ------------ ------------- -------------------- -------------------------------------------- -------------------------------------------- -------------------------------------------- -------------------------------------------- -------------------------------------------------------- -------------------------------------------------------- -------------------------------------------------
|owner |ownerId|batch|mode|subType|versionId|createdDate |modifiedDate |createdBy |modifiedBy |numChanges|progressFlag|auditing_hack|auditing_facetCd |auditing_modelColors |auditing_consumersGroup |auditing_cnsmrTargets_id |auditing_cnsmrTargets_type |auditing_modelGroups_seedIds |auditing_txGroups_txIds |auditing_txGroups_shops |
------ ------- ----- ---- ------- --------- ------------------------ ------------------------ ------------ ------------ ---------- ------------ ------------- -------------------- -------------------------------------------- -------------------------------------------- -------------------------------------------- -------------------------------------------- -------------------------------------------------------- -------------------------------------------------------- -------------------------------------------------
|mindey|4e751a |test1|null|self |v89.0563 |2021-12-10T00:41:02.652Z|2021-12-10T00:41:02.652Z|[email protected]|[email protected]|10 |true |TBD |["652439","A97341"] |["083y32-058","978t22-456","A097ee-009"] |25a9874-098a-08nh-4424-90838e |7jsfg8734-038h-9825-9784-37768ebe6ba1 |self |["7kjshf99-754a-gf78-a015","6d874mq3-23fa-4e76-baad"] |["7bw97862-aca7-9873","5ddfb803-2f11-8763"] |["ue749nt2-acc6-2452","ue749nt2-982y-4286"]} |
------ ------- ----- ---- ------- --------- ------------------------ ------------------------ ------------ ------------ ---------- ------------ ------------- -------------------- -------------------------------------------- -------------------------------------------- -------------------------------------------- -------------------------------------------- -------------------------------------------------------- -------------------------------------------------------- -------------------------------------------------
Can someone help me with this result. Thanks in advance!!
CodePudding user response:
The schema object should look like this:
sch = StructType([
StructField("owner",StringType()),
StructField("ownerId",StringType()),
StructField("batch",StringType()),
StructField("mode",StringType()),
StructField("subType",StringType()),
StructField("versionId",StringType()),
StructField("createdDate",StringType()),
StructField("modifiedDate",StringType()),
StructField("createdBy",StringType()),
StructField("modifiedBy",StringType()),
StructField("numChanges",StringType()),
StructField("progressFlag",StringType()),
StructField("auditing",StructType([
StructField('hack', StringType()),
StructField('facetCd', ArrayType(StringType())),
StructField('modelColors', ArrayType(StringType())),
StructField('consumersGroup', ArrayType(StringType())),
StructField('cnsmrTargets', ArrayType(StringType())),
StructField('modelGroups', ArrayType(StructType([
StructField('seedIds', ArrayType(StringType())),
StructField('exact', StringType()),
StructField('position', IntegerType()),
]))),
StructField('facetCd', ArrayType(StructType([
StructField('txIds', ArrayType(StringType())),
]))),
StructField('shops', ArrayType(StringType())),
]))
]
)
Once you apply the schema, you'll get a nested struct for auditing column. Then a combination of select statement and alias can get you the desired column names.