Home > other >  Pyspark flatten Json value inside column
Pyspark flatten Json value inside column

Time:02-01

I have data frame as below

Json_column
------------------------------------------------------------------------------------------------------------------------------------------------------------------------
{"coordinates":[null,null,null,null,null],"datetime":[1642602463000,1642600679000,1642598301000,1642598232000,1642596529000],"followers_count":[568,5037,76,4325,107]}
{"coordinates":[null,null,null,null,null],"datetime":[1641919643000,1641918112000,1641918082000,1641917719000,1641916830000],"followers_count":[233,63,99,750,186]}
 ---------------------------------------------------------------------------------------------------------------------------------------------------------------------

need to flatten this data frame as below

 --------------------------------------------------------------------------- ------------------------------------------------------------------------------- -------------------------- 

|datetime                                                                   |coordinates                                                                |followers_count       |

 --------------------------------------------------------------------------- ------------------------------------------------------------------------------- -------------------------- 

|1642602463000                                  | null                                  | 568              |

|1642600679000                                  | null                                      | 5037             |

|1642598301000                                  | null                                  | 76               |

|1642598232000                                  | null                                      | 4325             |

|1642596529000                                  | null                                      | 107              |

|1642602463000                                  | null                                  | 233              |

|1641918112000                                  | null                                      | 63               |

|1641918082000                                  | null                                  | 99               |

|1641917719000                                  | null                                      | 750              |

|1641916830000                                  | null                                      | 186              |

 --------------------------------------------------------------------------- ------------------------------------------------------------------------------- -------------------------- 

I tried this code

df.withColumn("datetime",F.get_json_object(col("Json_column"),"$.datetime")).
withColumn("coordinates",F.get_json_object(col("Json_column"),"$.coordinates")).
withColumn("followers_count",F.get_json_object(col("Json_column"),"$.followers_count"))
.select('datetime','followers_count','coordinates')

but it return array list not flatten data

CodePudding user response:

Use from_json to parse the json strings into struct type, then arrays_zip the arrays fields inside the struct and explode the result:

from pyspark.sql import functions as F

result = df.withColumn(
    "Json_column",
    F.from_json(
        "Json_column",
        "struct<coordinates:array<string>,datetime:array<long>,followers_count:array<int>>"
    )
).withColumn(
    "Json_column",
    F.arrays_zip("Json_column.datetime", "Json_column.coordinates", "Json_column.followers_count")
).selectExpr(
    "inline(Json_column)"
)

result.show()
# ------------- ----------- --------------- 
#|datetime     |coordinates|followers_count|
# ------------- ----------- --------------- 
#|1642602463000|null       |568            |
#|1642600679000|null       |5037           |
#|1642598301000|null       |76             |
#|1642598232000|null       |4325           |
#|1642596529000|null       |107            |
#|1641919643000|null       |233            |
#|1641918112000|null       |63             |
#|1641918082000|null       |99             |
#|1641917719000|null       |750            |
#|1641916830000|null       |186            |
# ------------- ----------- --------------- 
  •  Tags:  
  • Related