I have dataframe which has 2 columns ID and input_array( values are json arrays)
ID Input_array
1 [ {“A”:300, “B”:400}, { “A”:500,”B”: 600} ]
2 [ {“A”: 800, “B”: 900} ]
Output i need something
Id A B
1 300 400
1 500 600
2 800 900
Attached image.enter image description here
CodePudding user response:
You can remove square brackets by using regexp_replace
or substring
functions
Then you can transform strings with multiple jsons to an array by using split
function
Then you can unwrap the array and make new row for each element in the array by using explode
function
Then you can handle column with json by using from_json
function
CodePudding user response:
If Input_array is string then you need to parse this string as a JSON and then explode it into rows and expand the keys to columns. You can parse the array as using ArrayType data structure:
from pyspark.sql.types import *
from pyspark.sql import functions as F
data = [('1', '[{"A":300, "B":400},{ "A":500,"B": 600}]')
,('2', '[{"A": 800, "B": 900}]')
]
my_schema = ArrayType(
StructType([
StructField('A', IntegerType()),
StructField('B', IntegerType())
])
)
df = spark.createDataFrame(data, ['id', 'Input_array'])\
.withColumn('Input_array', F.from_json('Input_array', my_schema))\
.select("id", F.explode("Input_array").alias("Input_array"))\
.select("id", F.col('Input_array.*'))
df.show(truncate=False)
# --- --- ---
# |id |A |B |
# --- --- ---
# |1 |300|400|
# |1 |500|600|
# |2 |800|900|
# --- --- ---