Home > Blockchain >  Dealing with nested JSON in PySpark
Dealing with nested JSON in PySpark

Time:08-26

I'm loading a JSON file into PySpark:

df = spark.read.json("20220824211022.json")
df.show()
 -------------------- -------------------- -------------------- 
|                data|            includes|                meta|
 -------------------- -------------------- -------------------- 
|[{961778216070344...|{[{2018-02-09T01:...|{1562543391161741...|
 -------------------- -------------------- -------------------- 

The two columns I'm interested in here are data and includes. For data, I ran the following:

df2 = df.withColumn("data", F.explode(F.col("data"))).select("data.*")
df2.show(2)
 ------------------- -------------------- ------------------- -------------- -------------------- 
|          author_id|          created_at|                 id|public_metrics|                text|
 ------------------- -------------------- ------------------- -------------- -------------------- 
| 961778216070344705|2022-08-24T20:52:...|1562543391161741312|  {0, 0, 0, 2}|With Kaskada, you...|
|1275784834321768451|2022-08-24T20:47:...|1562542031284555777|  {2, 0, 0, 0}|Below is a protot...|
 ------------------- -------------------- ------------------- -------------- -------------------- 

Which is something I can work with. However I can't do the same with the includes column as it has the {} enclosing the [].

Is there a way for me to deal with this using PySpark?

EDIT:

If you were to look at the includes sections in the JSON file, it looks like:

"includes": {"users": [{"id": "893899303" .... }, ...]},

So ideally in the first table in my question, I'd want the includes to be users, or at least be able to drill down to users

CodePudding user response:

As your includes column is a MapType with key value = "users", you can use the .getItem() to get the array by the key, that is:

df3 = df.withColumn("includes", F.explode(F.col("includes").getItem("users"))).select("includes.*")
  • Related