Explode json array into rows-CodePudding

I have dataframe which has 2 columns ID and input_array( values are json arrays)

ID   Input_array
1    [ {“A”:300, “B”:400}, { “A”:500,”B”: 600} ]
2    [ {“A”: 800, “B”: 900} ]

Output i need something

Id A      B
1  300    400
1  500    600
2  800    900

Attached image.enter image description here

CodePudding user response：

You can remove square brackets by using regexp_replace or substring functions
Then you can transform strings with multiple jsons to an array by using split function
Then you can unwrap the array and make new row for each element in the array by using explode function
Then you can handle column with json by using from_json function

Doc: pyspark.sql.functions

CodePudding user response：

If Input_array is string then you need to parse this string as a JSON and then explode it into rows and expand the keys to columns. You can parse the array as using ArrayType data structure:

from pyspark.sql.types import *
from pyspark.sql import functions as F


data = [('1', '[{"A":300, "B":400},{ "A":500,"B": 600}]')
       ,('2', '[{"A": 800, "B": 900}]')
       ]

my_schema = ArrayType(
                      StructType([
                          StructField('A', IntegerType()),
                          StructField('B', IntegerType())
                      ])
                      )
    
df = spark.createDataFrame(data, ['id', 'Input_array'])\
    .withColumn('Input_array', F.from_json('Input_array', my_schema))\
    .select("id", F.explode("Input_array").alias("Input_array"))\
    .select("id", F.col('Input_array.*'))

df.show(truncate=False)

#  --- --- --- 
# |id |A  |B  |
#  --- --- --- 
# |1  |300|400|
# |1  |500|600|
# |2  |800|900|
#  --- --- ---