Dealing with nested JSON in PySpark-CodePudding

I'm loading a JSON file into PySpark:

df = spark.read.json("20220824211022.json")
df.show()

 -------------------- -------------------- -------------------- 
|                data|            includes|                meta|
 -------------------- -------------------- -------------------- 
|[{961778216070344...|{[{2018-02-09T01:...|{1562543391161741...|
 -------------------- -------------------- --------------------

The two columns I'm interested in here are data and includes. For data, I ran the following:

df2 = df.withColumn("data", F.explode(F.col("data"))).select("data.*")
df2.show(2)

 ------------------- -------------------- ------------------- -------------- -------------------- 
|          author_id|          created_at|                 id|public_metrics|                text|
 ------------------- -------------------- ------------------- -------------- -------------------- 
| 961778216070344705|2022-08-24T20:52:...|1562543391161741312|  {0, 0, 0, 2}|With Kaskada, you...|
|1275784834321768451|2022-08-24T20:47:...|1562542031284555777|  {2, 0, 0, 0}|Below is a protot...|
 ------------------- -------------------- ------------------- -------------- --------------------

Which is something I can work with. However I can't do the same with the includes column as it has the {} enclosing the [].

Is there a way for me to deal with this using PySpark?

EDIT:

If you were to look at the includes sections in the JSON file, it looks like:

"includes": {"users": [{"id": "893899303" .... }, ...]},

So ideally in the first table in my question, I'd want the includes to be users, or at least be able to drill down to users

CodePudding user response：

As your includes column is a MapType with key value = "users", you can use the .getItem() to get the array by the key, that is:

df3 = df.withColumn("includes", F.explode(F.col("includes").getItem("users"))).select("includes.*")