Home > other >  How to convert String to JSON in Spark SQL?
How to convert String to JSON in Spark SQL?

Time:10-04

I was working with the "Delta Logs" of Delta Table and the data of Delta table was stored in the Azure Blob Storage.

I used the below query to fetch the JSON data of Delta Log:

SELECT * FROM json.`/mnt/blob/deltaTables/employees/_delta_log/00000000000000000000.json`

I was able to fetch the JSON data as a table in the Databricks notebook (screenshot below):

enter image description here

Another screenshot for add.stats

enter image description here

Now I traverse through the JSON data in the query itself:

SELECT add.stats FROM json.`/mnt/blob/deltaTables/employees/_delta_log/00000000000000000000.json`

But after the stats the JSON data is in the string format and I am not able to traverse through further.

I am attaching the JSON file here for reference:

{"protocol":{"minReaderVersion":1,"minWriterVersion":2}}
{"metaData":{"id":"d21b496a-7282-49c9-a71c-3013d780fbeb","format":{"provider":"parquet","options":{}},"schemaString":"{\"type\":\"struct\",\"fields\":[{\"name\":\"Id\",\"type\":\"integer\",\"nullable\":true,\"metadata\":{}},{\"name\":\"Name\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"city\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"age\",\"type\":\"integer\",\"nullable\":true,\"metadata\":{}},{\"name\":\"department\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"degree\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"workingLocation\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"maxEducation\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"experience\",\"type\":\"integer\",\"nullable\":true,\"metadata\":{}},{\"name\":\"jobRole\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}}]}","partitionColumns":[],"configuration":{},"createdTime":1664788829119}}
{"add":{"path":"part-00000-91e882d3-f9bc-481f-ba50-a7d061040401-c000.snappy.parquet","partitionValues":{},"size":6337133,"modificationTime":1664788872000,"dataChange":true,"stats":"{\"numRecords\":136251,\"minValues\":{\"Id\":1,\"Name\":\"Robin\",\"city\":\"Jhunjhunu\",\"age\":22,\"department\":\"Data Integration\",\"degree\":\"bsc\",\"workingLocation\":\"Jaipur\",\"maxEducation\":\"Graduation\",\"experience\":2,\"jobRole\":\"Data Engineer\"},\"maxValues\":{\"Id\":136251,\"Name\":\"Robin99999\",\"city\":\"Jhunjhunu99999\",\"age\":136272,\"department\":\"Data Integration99999\",\"degree\":\"bsc99999\",\"workingLocation\":\"Jaipur99999\",\"maxEducation\":\"Graduation99999\",\"experience\":136252,\"jobRole\":\"Data Engineer99999\"},\"nullCount\":{\"Id\":0,\"Name\":0,\"city\":0,\"age\":0,\"department\":0,\"degree\":0,\"workingLocation\":0,\"maxEducation\":0,\"experience\":0,\"jobRole\":0}}","tags":{"INSERTION_TIME":"1664788862000000","OPTIMIZE_TARGET_SIZE":"268435456"}}}
{"add":{"path":"part-00001-25af3704-bc8d-4201-bbe8-6e6b07864e40-c000.snappy.parquet","partitionValues":{},"size":5969233,"modificationTime":1664788870000,"dataChange":true,"stats":"{\"numRecords\":127898,\"minValues\":{\"Id\":136252,\"Name\":\"Robin136251\",\"city\":\"Jhunjhunu136251\",\"age\":136273,\"department\":\"Data Integration136251\",\"degree\":\"bsc136251\",\"workingLocation\":\"Jaipur136251\",\"maxEducation\":\"Graduation136251\",\"experience\":136253,\"jobRole\":\"Data Engineer136251\"},\"maxValues\":{\"Id\":264149,\"Name\":\"Robin264148\",\"city\":\"Jhunjhunu264148\",\"age\":264170,\"department\":\"Data Integration264148\",\"degree\":\"bsc264148\",\"workingLocation\":\"Jaipur264148\",\"maxEducation\":\"Graduation264148\",\"experience\":264150,\"jobRole\":\"Data Engineer264148\"},\"nullCount\":{\"Id\":0,\"Name\":0,\"city\":0,\"age\":0,\"department\":0,\"degree\":0,\"workingLocation\":0,\"maxEducation\":0,\"experience\":0,\"jobRole\":0}}","tags":{"INSERTION_TIME":"1664788862000001","OPTIMIZE_TARGET_SIZE":"268435456"}}}
{"add":{"path":"part-00002-0fbaae5e-371e-45ea-b2ce-b959248ba88a-c000.snappy.parquet","partitionValues":{},"size":5961349,"modificationTime":1664788871000,"dataChange":true,"stats":"{\"numRecords\":127898,\"minValues\":{\"Id\":264150,\"Name\":\"Robin264149\",\"city\":\"Jhunjhunu264149\",\"age\":264171,\"department\":\"Data Integration264149\",\"degree\":\"bsc264149\",\"workingLocation\":\"Jaipur264149\",\"maxEducation\":\"Graduation264149\",\"experience\":264151,\"jobRole\":\"Data Engineer264149\"},\"maxValues\":{\"Id\":392047,\"Name\":\"Robin392046\",\"city\":\"Jhunjhunu392046\",\"age\":392068,\"department\":\"Data Integration392046\",\"degree\":\"bsc392046\",\"workingLocation\":\"Jaipur392046\",\"maxEducation\":\"Graduation392046\",\"experience\":392048,\"jobRole\":\"Data Engineer392046\"},\"nullCount\":{\"Id\":0,\"Name\":0,\"city\":0,\"age\":0,\"department\":0,\"degree\":0,\"workingLocation\":0,\"maxEducation\":0,\"experience\":0,\"jobRole\":0}}","tags":{"INSERTION_TIME":"1664788862000002","OPTIMIZE_TARGET_SIZE":"268435456"}}}
{"add":{"path":"part-00003-1c8ac662-7680-42b8-8ffb-eef96ad37085-c000.snappy.parquet","partitionValues":{},"size":5961263,"modificationTime":1664788862000,"dataChange":true,"stats":"{\"numRecords\":127898,\"minValues\":{\"Id\":392048,\"Name\":\"Robin392047\",\"city\":\"Jhunjhunu392047\",\"age\":392069,\"department\":\"Data Integration392047\",\"degree\":\"bsc392047\",\"workingLocation\":\"Jaipur392047\",\"maxEducation\":\"Graduation392047\",\"experience\":392049,\"jobRole\":\"Data Engineer392047\"},\"maxValues\":{\"Id\":519945,\"Name\":\"Robin519944\",\"city\":\"Jhunjhunu519944\",\"age\":519966,\"department\":\"Data Integration519944\",\"degree\":\"bsc519944\",\"workingLocation\":\"Jaipur519944\",\"maxEducation\":\"Graduation519944\",\"experience\":519946,\"jobRole\":\"Data Engineer519944\"},\"nullCount\":{\"Id\":0,\"Name\":0,\"city\":0,\"age\":0,\"department\":0,\"degree\":0,\"workingLocation\":0,\"maxEducation\":0,\"experience\":0,\"jobRole\":0}}","tags":{"INSERTION_TIME":"1664788862000003","OPTIMIZE_TARGET_SIZE":"268435456"}}}
{"add":{"path":"part-00004-5c355575-1751-464f-93a5-672ae1f60c29-c000.snappy.parquet","partitionValues":{},"size":5990168,"modificationTime":1664788867000,"dataChange":true,"stats":"{\"numRecords\":127898,\"minValues\":{\"Id\":519946,\"Name\":\"Robin519945\",\"city\":\"Jhunjhunu519945\",\"age\":519967,\"department\":\"Data Integration519945\",\"degree\":\"bsc519945\",\"workingLocation\":\"Jaipur519945\",\"maxEducation\":\"Graduation519945\",\"experience\":519947,\"jobRole\":\"Data Engineer519945\"},\"maxValues\":{\"Id\":647843,\"Name\":\"Robin647842\",\"city\":\"Jhunjhunu647842\",\"age\":647864,\"department\":\"Data Integration647842\",\"degree\":\"bsc647842\",\"workingLocation\":\"Jaipur647842\",\"maxEducation\":\"Graduation647842\",\"experience\":647844,\"jobRole\":\"Data Engineer647842\"},\"nullCount\":{\"Id\":0,\"Name\":0,\"city\":0,\"age\":0,\"department\":0,\"degree\":0,\"workingLocation\":0,\"maxEducation\":0,\"experience\":0,\"jobRole\":0}}","tags":{"INSERTION_TIME":"1664788862000004","OPTIMIZE_TARGET_SIZE":"268435456"}}}
{"add":{"path":"part-00005-fac91d9b-9528-4381-8983-419261e7b6ba-c000.snappy.parquet","partitionValues":{},"size":5962518,"modificationTime":1664788865000,"dataChange":true,"stats":"{\"numRecords\":127898,\"minValues\":{\"Id\":647844,\"Name\":\"Robin647843\",\"city\":\"Jhunjhunu647843\",\"age\":647865,\"department\":\"Data Integration647843\",\"degree\":\"bsc647843\",\"workingLocation\":\"Jaipur647843\",\"maxEducation\":\"Graduation647843\",\"experience\":647845,\"jobRole\":\"Data Engineer647843\"},\"maxValues\":{\"Id\":775741,\"Name\":\"Robin775740\",\"city\":\"Jhunjhunu775740\",\"age\":775762,\"department\":\"Data Integration775740\",\"degree\":\"bsc775740\",\"workingLocation\":\"Jaipur775740\",\"maxEducation\":\"Graduation775740\",\"experience\":775742,\"jobRole\":\"Data Engineer775740\"},\"nullCount\":{\"Id\":0,\"Name\":0,\"city\":0,\"age\":0,\"department\":0,\"degree\":0,\"workingLocation\":0,\"maxEducation\":0,\"experience\":0,\"jobRole\":0}}","tags":{"INSERTION_TIME":"1664788862000005","OPTIMIZE_TARGET_SIZE":"268435456"}}}
{"add":{"path":"part-00006-41366707-d8cc-1                                                              QQQQQQQAQQQQQQQQQQ46d2-b781-4d9bcc5b5210-c000.snappy.parquet","partitionValues":{},"size":5967367,"modificationTime":1664788866000,"dataChange":true,"stats":"{\"numRecords\":127898,\"minValues\":{\"Id\":775742,\"Name\":\"Robin775741\",\"city\":\"Jhunjhunu775741\",\"age\":775763,\"department\":\"Data Integration775741\",\"degree\":\"bsc775741\",\"workingLocation\":\"Jaipur775741\",\"maxEducation\":\"Graduation775741\",\"experience\":775743,\"jobRole\":\"Data Engineer775741\"},\"maxValues\":{\"Id\":903639,\"Name\":\"Robin903638\",\"city\":\"Jhunjhunu903638\",\"age\":903660,\"department\":\"Data Integration903638\",\"degree\":\"bsc903638\",\"workingLocation\":\"Jaipur903638\",\"maxEducation\":\"Graduation903638\",\"experience\":903640,\"jobRole\":\"Data Engineer903638\"},\"nullCount\":{\"Id\":0,\"Name\":0,\"city\":0,\"age\":0,\"department\":0,\"degree\":0,\"workingLocation\":0,\"maxEducation\":0,\"experience\":0,\"jobRole\":0}}","tags":{"INSERTION_TIME":"1664788862000006","OPTIMIZE_TARGET_SIZE":"268435456"}}}
{"add":{"path":"part-00007-8c0a3670-cb24-4cdc-ac37-a9f173c7fe71-c000.snappy.parquet","partitionValues":{},"size":4498552,"modificationTime":1664788864000,"dataChange":true,"stats":"{\"numRecords\":96361,\"minValues\":{\"Id\":903640,\"Name\":\"Robin903639\",\"city\":\"Jhunjhunu903639\",\"age\":903661,\"department\":\"Data Integration903639\",\"degree\":\"bsc903639\",\"workingLocation\":\"Jaipur903639\",\"maxEducation\":\"Graduation903639\",\"experience\":903641,\"jobRole\":\"Data Engineer903639\"},\"maxValues\":{\"Id\":1000000,\"Name\":\"Robin999999\",\"city\":\"Jhunjhunu999999\",\"age\":1000021,\"department\":\"Data Integration999999\",\"degree\":\"bsc999999\",\"workingLocation\":\"Jaipur999999\",\"maxEducation\":\"Graduation999999\",\"experience\":1000001,\"jobRole\":\"Data Engineer999999\"},\"nullCount\":{\"Id\":0,\"Name\":0,\"city\":0,\"age\":0,\"department\":0,\"degree\":0,\"workingLocation\":0,\"maxEducation\":0,\"experience\":0,\"jobRole\":0}}","tags":{"INSERTION_TIME":"1664788862000007","OPTIMIZE_TARGET_SIZE":"268435456"}}}
{"commitInfo":{"timestamp":1664788877620,"userId":"2102279527814428","userName":"[email protected]","operation":"WRITE","operationParameters":{"mode":"Overwrite","partitionBy":"[]"},"notebook":{"notebookId":"2204946593481188"},"clusterId":"1003-090126-rztrsnm4","isolationLevel":"WriteSerializable","isBlindAppend":false,"operationMetrics":{"numFiles":"8","numOutputRows":"1000000","numOutputBytes":"46647583"},"engineInfo":"Databricks-Runtime/10.4.x-scala2.12","txnId":"17e89e4f-380f-4888-9aaf-2ada42e13b8c"}}

Ultimately, I want to fetch the maximum value of experience using SQL

CodePudding user response:

Starting with the example "stats" column of type string which you have...

from pyspark.sql import functions as F
df = spark.createDataFrame(
    [("{\"numRecords\":136251,\"minValues\":{\"Id\":1,\"Name\":\"Robin\",\"city\":\"Jhunjhunu\",\"age\":22,\"department\":\"Data Integration\",\"degree\":\"bsc\",\"workingLocation\":\"Jaipur\",\"maxEducation\":\"Graduation\",\"experience\":2,\"jobRole\":\"Data Engineer\"},\"maxValues\":{\"Id\":136251,\"Name\":\"Robin99999\",\"city\":\"Jhunjhunu99999\",\"age\":136272,\"department\":\"Data Integration99999\",\"degree\":\"bsc99999\",\"workingLocation\":\"Jaipur99999\",\"maxEducation\":\"Graduation99999\",\"experience\":136252,\"jobRole\":\"Data Engineer99999\"},\"nullCount\":{\"Id\":0,\"Name\":0,\"city\":0,\"age\":0,\"department\":0,\"degree\":0,\"workingLocation\":0,\"maxEducation\":0,\"experience\":0,\"jobRole\":0}}",),
     ("{\"numRecords\":127898,\"minValues\":{\"Id\":136252,\"Name\":\"Robin136251\",\"city\":\"Jhunjhunu136251\",\"age\":136273,\"department\":\"Data Integration136251\",\"degree\":\"bsc136251\",\"workingLocation\":\"Jaipur136251\",\"maxEducation\":\"Graduation136251\",\"experience\":136253,\"jobRole\":\"Data Engineer136251\"},\"maxValues\":{\"Id\":264149,\"Name\":\"Robin264148\",\"city\":\"Jhunjhunu264148\",\"age\":264170,\"department\":\"Data Integration264148\",\"degree\":\"bsc264148\",\"workingLocation\":\"Jaipur264148\",\"maxEducation\":\"Graduation264148\",\"experience\":264150,\"jobRole\":\"Data Engineer264148\"},\"nullCount\":{\"Id\":0,\"Name\":0,\"city\":0,\"age\":0,\"department\":0,\"degree\":0,\"workingLocation\":0,\"maxEducation\":0,\"experience\":0,\"jobRole\":0}}",),
     ("{\"numRecords\":127898,\"minValues\":{\"Id\":264150,\"Name\":\"Robin264149\",\"city\":\"Jhunjhunu264149\",\"age\":264171,\"department\":\"Data Integration264149\",\"degree\":\"bsc264149\",\"workingLocation\":\"Jaipur264149\",\"maxEducation\":\"Graduation264149\",\"experience\":264151,\"jobRole\":\"Data Engineer264149\"},\"maxValues\":{\"Id\":392047,\"Name\":\"Robin392046\",\"city\":\"Jhunjhunu392046\",\"age\":392068,\"department\":\"Data Integration392046\",\"degree\":\"bsc392046\",\"workingLocation\":\"Jaipur392046\",\"maxEducation\":\"Graduation392046\",\"experience\":392048,\"jobRole\":\"Data Engineer392046\"},\"nullCount\":{\"Id\":0,\"Name\":0,\"city\":0,\"age\":0,\"department\":0,\"degree\":0,\"workingLocation\":0,\"maxEducation\":0,\"experience\":0,\"jobRole\":0}}",)],
    ["stats"])

You can use from_json (providing schema path to the object that you need ("experience")) to extract that object together with the structure leading to the object. Then you can remove the structure by just telling the struct path to it (F.col("exp.maxValues.experience")).

df = df.withColumn("exp", F.from_json("stats", "maxValues struct<experience:long>"))
df = df.withColumn("exp", F.col("exp.maxValues.experience"))

df.show()
#  -------------------- ------ 
# |               stats|   exp|
#  -------------------- ------ 
# |{"numRecords":136...|136252|
# |{"numRecords":127...|264150|
# |{"numRecords":127...|392048|
#  -------------------- ------ 

In SQL you could do it like this:

spark.sql("""
    SELECT from_json(stats, 'maxValues struct<experience:long>').maxValues.experience as exp
    FROM df
""").show()
#  ------ 
# |   exp|
#  ------ 
# |136252|
# |264150|
# |392048|
#  ------ 
  • Related