One of the datasets I'm working with has a column called json_data, which contains data like this:
{
"eta": "",
"eta_value": 0,
"schedules": [{
"open_time": "10:15:00",
"close_time": "14:00:00"
}, {
"open_time": "18:00:00",
"close_time": "20:00:00"
}],
"logo": "1617723892776.png",
"score_v2": 0,
"id": "900371722_8339714",
"store_id": 900371722,
"super_store_id": 900371722,
"index": 375,
"brand_name": "Carl's Restaurant",
"store_type": "restaurant",
"has_promise": false,
"tags": [189],
"background": "1618349497.jpg",
"is_enabled": false,
"friendly_url": {
"store_id": 90037172
}
}
This column is a string type, which means I cannot easily turn the info inside it into columns. And that's what brings me here: how can I turn the data here in columns? Specially with the nested data inside "schedules".
I'm having a hard time with this column.
CodePudding user response:
Few months back, I was also struggling with similar json
structure. Glad you brought it up, helped refreshing my memory!
I followed the below steps to resolve -
Input Data
df1.show(truncate=False)
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
|jsondata |
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
|{ "eta": "", "eta_value": 0, "schedules": [{ "open_time": "10:15:00", "close_time": "14:00:00" }, { "open_time": "18:00:00", "close_time": "20:00:00" }], "logo": "1617723892776.png", "score_v2": 0, "id": "900371722_8339714", "store_id": 900371722, "super_store_id": 900371722, "index": 375, "brand_name": "Carl's Restaurant", "store_type": "restaurant", "has_promise": false, "tags": [189], "background": "1618349497.jpg", "is_enabled": false, "friendly_url": { "store_id": 90037172 } }|
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Convert jsondata
column to MapType
as below -
from pyspark.sql.functions import *
from pyspark.sql.types import *
df2 = df1.withColumn("cols", from_json( "jsondata", MapType(StringType(), StringType()) )).drop("jsondata")
df2.show(truncate=False)
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
|cols |
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
|{eta -> , eta_value -> 0, schedules -> [{"open_time":"10:15:00","close_time":"14:00:00"},{"open_time":"18:00:00","close_time":"20:00:00"}], logo -> 1617723892776.png, score_v2 -> 0, id -> 900371722_8339714, store_id -> 900371722, super_store_id -> 900371722, index -> 375, brand_name -> Carl's Restaurant, store_type -> restaurant, has_promise -> false, tags -> [189], background -> 1618349497.jpg, is_enabled -> false, friendly_url -> {"store_id":90037172}}|
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Now, column cols
needs to be exploded as below -
df3= df2.select(explode("cols").alias("col_columns", "col_rows"))
df3.show(truncate=False)
-------------- ---------------------------------------------------------------------------------------------------
|col_columns |col_rows |
-------------- ---------------------------------------------------------------------------------------------------
|eta | |
|eta_value |0 |
|schedules |[{"open_time":"10:15:00","close_time":"14:00:00"},{"open_time":"18:00:00","close_time":"20:00:00"}]|
|logo |1617723892776.png |
|score_v2 |0 |
|id |900371722_8339714 |
|store_id |900371722 |
|super_store_id|900371722 |
|index |375 |
|brand_name |Carl's Restaurant |
|store_type |restaurant |
|has_promise |false |
|tags |[189] |
|background |1618349497.jpg |
|is_enabled |false |
|friendly_url |{"store_id":90037172} |
-------------- ---------------------------------------------------------------------------------------------------
Once, you have col_columns
and col_rows
as individual columns, all that is needed to do is pivot col_columns
and aggregate it using its corresponding first col_rows
as below -
df4 = (df3.groupBy()
.pivot("col_columns")
.agg(first("col_rows"))
)
df4.show(truncate=False)
Output
-------------- ----------------- --- --------- --------------------- ----------- ----------------- ----- ---------- ----------------- --------------------------------------------------------------------------------------------------- -------- --------- ---------- -------------- -----
|background |brand_name |eta|eta_value|friendly_url |has_promise|id |index|is_enabled|logo |schedules |score_v2|store_id |store_type|super_store_id|tags |
-------------- ----------------- --- --------- --------------------- ----------- ----------------- ----- ---------- ----------------- --------------------------------------------------------------------------------------------------- -------- --------- ---------- -------------- -----
|1618349497.jpg|Carl's Restaurant| |0 |{"store_id":90037172}|false |900371722_8339714|375 |false |1617723892776.png|[{"open_time":"10:15:00","close_time":"14:00:00"},{"open_time":"18:00:00","close_time":"20:00:00"}]|0 |900371722|restaurant|900371722 |[189]|
-------------- ----------------- --- --------- --------------------- ----------- ----------------- ----- ---------- ----------------- --------------------------------------------------------------------------------------------------- -------- --------- ---------- -------------- -----
P.S. - If you want to explode the columns like schedules
, friendly_url
as well then you might have to repeat the above steps. Something as below -
df4 = df4.withColumn("schedule_json", from_json("schedules", ArrayType(MapType(StringType(),StringType()))))
df4.select(explode("schedule_json").alias("schedules")).select(explode("schedules")).show(truncate=False)
---------- --------
|key |value |
---------- --------
|open_time |10:15:00|
|close_time|14:00:00|
|open_time |18:00:00|
|close_time|20:00:00|
---------- --------