How "explode" the data on a cell formatted as string so I can turn the keys into columns o-CodePudding

One of the datasets I'm working with has a column called json_data, which contains data like this:

{
    "eta": "",
    "eta_value": 0,
    "schedules": [{
        "open_time": "10:15:00",
        "close_time": "14:00:00"
    }, {
        "open_time": "18:00:00",
        "close_time": "20:00:00"
    }],
    "logo": "1617723892776.png",
    "score_v2": 0,
    "id": "900371722_8339714",
    "store_id": 900371722,
    "super_store_id": 900371722,
    "index": 375,
    "brand_name": "Carl's Restaurant",
    "store_type": "restaurant",
    "has_promise": false,
    "tags": [189],
    "background": "1618349497.jpg",
    "is_enabled": false,
    "friendly_url": {
        "store_id": 90037172
    }
}

This column is a string type, which means I cannot easily turn the info inside it into columns. And that's what brings me here: how can I turn the data here in columns? Specially with the nested data inside "schedules".

I'm having a hard time with this column.

CodePudding user response：

Few months back, I was also struggling with similar json structure. Glad you brought it up, helped refreshing my memory!
I followed the below steps to resolve -

Input Data

df1.show(truncate=False)

 -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 
|jsondata                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  |
 -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 
|{     "eta": "",     "eta_value": 0,     "schedules": [{         "open_time": "10:15:00",         "close_time": "14:00:00"     }, {         "open_time": "18:00:00",         "close_time": "20:00:00"     }],     "logo": "1617723892776.png",     "score_v2": 0,     "id": "900371722_8339714",     "store_id": 900371722,     "super_store_id": 900371722,     "index": 375,     "brand_name": "Carl's Restaurant",     "store_type": "restaurant",     "has_promise": false,     "tags": [189],     "background": "1618349497.jpg",     "is_enabled": false,     "friendly_url": {         "store_id": 90037172     } }|
 --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Convert jsondata column to MapType as below -

from pyspark.sql.functions import *
from pyspark.sql.types import *

df2 = df1.withColumn("cols", from_json( "jsondata", MapType(StringType(), StringType()) )).drop("jsondata")
df2.show(truncate=False)

 -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 
|cols                                                                                                                                                                                                                                                                                                                                                                                                                                                                      |
 -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 
|{eta -> , eta_value -> 0, schedules -> [{"open_time":"10:15:00","close_time":"14:00:00"},{"open_time":"18:00:00","close_time":"20:00:00"}], logo -> 1617723892776.png, score_v2 -> 0, id -> 900371722_8339714, store_id -> 900371722, super_store_id -> 900371722, index -> 375, brand_name -> Carl's Restaurant, store_type -> restaurant, has_promise -> false, tags -> [189], background -> 1618349497.jpg, is_enabled -> false, friendly_url -> {"store_id":90037172}}|
 --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Now, column cols needs to be exploded as below -

df3= df2.select(explode("cols").alias("col_columns", "col_rows"))
df3.show(truncate=False)

 -------------- --------------------------------------------------------------------------------------------------- 
|col_columns   |col_rows                                                                                           |
 -------------- --------------------------------------------------------------------------------------------------- 
|eta           |                                                                                                   |
|eta_value     |0                                                                                                  |
|schedules     |[{"open_time":"10:15:00","close_time":"14:00:00"},{"open_time":"18:00:00","close_time":"20:00:00"}]|
|logo          |1617723892776.png                                                                                  |
|score_v2      |0                                                                                                  |
|id            |900371722_8339714                                                                                  |
|store_id      |900371722                                                                                          |
|super_store_id|900371722                                                                                          |
|index         |375                                                                                                |
|brand_name    |Carl's Restaurant                                                                                  |
|store_type    |restaurant                                                                                         |
|has_promise   |false                                                                                              |
|tags          |[189]                                                                                              |
|background    |1618349497.jpg                                                                                     |
|is_enabled    |false                                                                                              |
|friendly_url  |{"store_id":90037172}                                                                              |
 -------------- ---------------------------------------------------------------------------------------------------

Once, you have col_columns and col_rows as individual columns, all that is needed to do is pivot col_columns and aggregate it using its corresponding first col_rows as below -

df4 = (df3.groupBy()
         .pivot("col_columns")
         .agg(first("col_rows"))
)
df4.show(truncate=False)

Output

 -------------- ----------------- --- --------- --------------------- ----------- ----------------- ----- ---------- ----------------- --------------------------------------------------------------------------------------------------- -------- --------- ---------- -------------- ----- 
|background    |brand_name       |eta|eta_value|friendly_url         |has_promise|id               |index|is_enabled|logo             |schedules                                                                                          |score_v2|store_id |store_type|super_store_id|tags |
 -------------- ----------------- --- --------- --------------------- ----------- ----------------- ----- ---------- ----------------- --------------------------------------------------------------------------------------------------- -------- --------- ---------- -------------- ----- 
|1618349497.jpg|Carl's Restaurant|   |0        |{"store_id":90037172}|false      |900371722_8339714|375  |false     |1617723892776.png|[{"open_time":"10:15:00","close_time":"14:00:00"},{"open_time":"18:00:00","close_time":"20:00:00"}]|0       |900371722|restaurant|900371722     |[189]|
 -------------- ----------------- --- --------- --------------------- ----------- ----------------- ----- ---------- ----------------- --------------------------------------------------------------------------------------------------- -------- --------- ---------- -------------- -----

P.S. - If you want to explode the columns like schedules, friendly_url as well then you might have to repeat the above steps. Something as below -

df4 = df4.withColumn("schedule_json", from_json("schedules", ArrayType(MapType(StringType(),StringType()))))

df4.select(explode("schedule_json").alias("schedules")).select(explode("schedules")).show(truncate=False)

 ---------- -------- 
|key       |value   |
 ---------- -------- 
|open_time |10:15:00|
|close_time|14:00:00|
|open_time |18:00:00|
|close_time|20:00:00|
 ---------- --------