Requirements:
I wanted to create a dataframe out of one column (existing dataframe ). That column value is multiple json list.
Problem:
Since the json does not have a fixed schema, i wasn't able to use the from_json
function since it needs schema before to parse the columns.
Example
| Column A | Column B |
| 1 | [{"id":"123","phone":"124"}] |
| 3 | [{"id":"456","phone":"741"}] |
Expected output:
| id | phone|
| 123 | 124 |
| 456 | 741 |
Any thoughts on this ?
CodePudding user response:
Try using Spark SQL to explode the "Column B" Array
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, IntegerType, StringType, ArrayType
spark = SparkSession.builder.appName("Test_app").getOrCreate()
input_data = [
(1, [{"id":"123","phone":"124"}]),
(3, [{"id":"456","phone":"741"}])
]
schema = StructType([
StructField("Column A", IntegerType(), True),
StructField("Column B", ArrayType(StructType([
StructField("id", StringType(), True),
StructField("phone", StringType(), True)
])), True)
])
df = spark.createDataFrame(input_data, schema)
df_exploded = df.selectExpr("Column A", "explode(Column B) as e") \
.select("e.id", "e.phone")
df_exploded.show()
Output is below ;
--- -----
| id|phone|
--- -----
|123| 124|
|456| 741|
--- -----
CodePudding user response:
Convert it into an rdd and then read it as json. For testing I have removed the id element in the second row.
input_data = [
(1, [{"id":"123","phone":"124"}]),
(3, [{"phone":"741"}])
]
df = spark.createDataFrame(input_data, ["ColA","ColB"])
spark.read.json(df.rdd.map(lambda r: r.ColB)).show()
---- -----
| id|phone|
---- -----
| 123| 124|
|null| 741|
---- -----