How to create a spark dataframe from one of the column in the existing dataframe-CodePudding

Requirements:

I wanted to create a dataframe out of one column (existing dataframe ). That column value is multiple json list.

Problem:

Since the json does not have a fixed schema, i wasn't able to use the from_json function since it needs schema before to parse the columns.

Example

| Column A |           Column B             | 
|   1      | [{"id":"123","phone":"124"}]   |
|   3      | [{"id":"456","phone":"741"}]   |

Expected output:

|  id   | phone| 
|  123  | 124  |
|  456  | 741  |

Any thoughts on this ?

CodePudding user response：

Try using Spark SQL to explode the "Column B" Array

from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, IntegerType, StringType, ArrayType


spark = SparkSession.builder.appName("Test_app").getOrCreate()

input_data = [
    (1, [{"id":"123","phone":"124"}]),
    (3, [{"id":"456","phone":"741"}])
]

schema = StructType([
    StructField("Column A", IntegerType(), True),
    StructField("Column B", ArrayType(StructType([
        StructField("id", StringType(), True),
        StructField("phone", StringType(), True)
    ])), True)
])

df = spark.createDataFrame(input_data, schema)

df_exploded = df.selectExpr("Column A", "explode(Column B) as e") \
                .select("e.id", "e.phone")

df_exploded.show()

Output is below ;

 --- ----- 
| id|phone|
 --- ----- 
|123|  124|
|456|  741|
 --- -----

CodePudding user response：

Convert it into an rdd and then read it as json. For testing I have removed the id element in the second row.

input_data = [
    (1, [{"id":"123","phone":"124"}]),
    (3, [{"phone":"741"}])
]
df = spark.createDataFrame(input_data, ["ColA","ColB"])
spark.read.json(df.rdd.map(lambda r: r.ColB)).show()

 ---- ----- 
|  id|phone|
 ---- ----- 
| 123|  124|
|null|  741|
 ---- -----