Interpreting dataframe out of Array JSON in Spark Structured Streaming-CodePudding

I am having a column of below json string in my dataframe how can I explode/flatten it to get single level dataframe ? Currently the schema is

df
|-json_data (StringType)

how can I have a df of below schema ?

df
|-key1
|-key2_signal
|-key2_value

[{
    "key1": 1647336730000,
    "key2": [
      {
        "signal": "signal_key_1",
        "value": 73.6
    },
      {
        "signal": "signal_key_2",
        "value": 3.375
    },
      {
        "signal": "signal_key_3",
        "value": 13.82
    }]
}]

CodePudding user response：

I guess it's how you should do it.

Explanation:

Creating the dummy for practice on your sample data. (df1 creation)
Take the string column and cast it to be an actual JSON.
Select all of the fields.
Explode key2 since it's a list.
select all the relevant keys from within key2
select key1 and key2 related data. using asterisk (*)

import pyspark.sql.functions as f
from pyspark.sql.types import *
import json
df1 = spark.createDataFrame(
    [{'json_data': [json.dumps({
        "key1": 1647336730000,
        "key2": [
            {
                "signal": "signal_key_1",
                "value": 73.6
            },
            {
                "signal": "signal_key_2",
                "value": 3.375
            },
            {
                "signal": "signal_key_3",
                "value": 13.82
            }]
    })]}],
    schema=StructType([StructField('json_data', StringType(), True)])
)
(
    df1
    .withColumn('actual', f.from_json(f.col('json_data'), f.schema_of_json(f.lit(df1.select(f.col("json_data")).first()[0]))))
    .withColumn('data', f.explode('actual'))
    .drop('actual')
    .withColumn('key2', f.explode('data.key2'))
    .drop('json_data')
    .select('data.key1', 'key2.*').show()
    )
#  ------------- ------------ ----- 
# |         key1|      signal|value|
#  ------------- ------------ ----- 
# |1647336730000|signal_key_1| 73.6|
# |1647336730000|signal_key_2|3.375|
# |1647336730000|signal_key_3|13.82|
#  ------------- ------------ -----