I'm loading a JSON file into PySpark:
df = spark.read.json("20220824211022.json")
df.show()
-------------------- -------------------- --------------------
| data| includes| meta|
-------------------- -------------------- --------------------
|[{961778216070344...|{[{2018-02-09T01:...|{1562543391161741...|
-------------------- -------------------- --------------------
The two columns I'm interested in here are data
and includes
. For data
, I ran the following:
df2 = df.withColumn("data", F.explode(F.col("data"))).select("data.*")
df2.show(2)
------------------- -------------------- ------------------- -------------- --------------------
| author_id| created_at| id|public_metrics| text|
------------------- -------------------- ------------------- -------------- --------------------
| 961778216070344705|2022-08-24T20:52:...|1562543391161741312| {0, 0, 0, 2}|With Kaskada, you...|
|1275784834321768451|2022-08-24T20:47:...|1562542031284555777| {2, 0, 0, 0}|Below is a protot...|
------------------- -------------------- ------------------- -------------- --------------------
Which is something I can work with. However I can't do the same with the includes
column as it has the {}
enclosing the []
.
Is there a way for me to deal with this using PySpark?
EDIT:
If you were to look at the includes
sections in the JSON file, it looks like:
"includes": {"users": [{"id": "893899303" .... }, ...]},
So ideally in the first table in my question, I'd want the includes
to be users
, or at least be able to drill down to users
CodePudding user response:
As your includes
column is a MapType
with key value = "users"
, you can use the .getItem()
to get the array by the key, that is:
df3 = df.withColumn("includes", F.explode(F.col("includes").getItem("users"))).select("includes.*")