I am having a column of below json string in my dataframe how can I explode/flatten it to get single level dataframe ? Currently the schema is
df
|-json_data (StringType)
how can I have a df of below schema ?
df
|-key1
|-key2_signal
|-key2_value
[{
"key1": 1647336730000,
"key2": [
{
"signal": "signal_key_1",
"value": 73.6
},
{
"signal": "signal_key_2",
"value": 3.375
},
{
"signal": "signal_key_3",
"value": 13.82
}]
}]
CodePudding user response:
I guess it's how you should do it.
Explanation:
- Creating the dummy for practice on your sample data. (df1 creation)
- Take the string column and cast it to be an actual JSON.
- Select all of the fields.
- Explode key2 since it's a list.
- select all the relevant keys from within key2
- select key1 and key2 related data. using asterisk (*)
import pyspark.sql.functions as f
from pyspark.sql.types import *
import json
df1 = spark.createDataFrame(
[{'json_data': [json.dumps({
"key1": 1647336730000,
"key2": [
{
"signal": "signal_key_1",
"value": 73.6
},
{
"signal": "signal_key_2",
"value": 3.375
},
{
"signal": "signal_key_3",
"value": 13.82
}]
})]}],
schema=StructType([StructField('json_data', StringType(), True)])
)
(
df1
.withColumn('actual', f.from_json(f.col('json_data'), f.schema_of_json(f.lit(df1.select(f.col("json_data")).first()[0]))))
.withColumn('data', f.explode('actual'))
.drop('actual')
.withColumn('key2', f.explode('data.key2'))
.drop('json_data')
.select('data.key1', 'key2.*').show()
)
# ------------- ------------ -----
# | key1| signal|value|
# ------------- ------------ -----
# |1647336730000|signal_key_1| 73.6|
# |1647336730000|signal_key_2|3.375|
# |1647336730000|signal_key_3|13.82|
# ------------- ------------ -----