Parse list of Json strings in a dataframe by converting the column into rows using pyspark?-CodePudding

//Input
df = [ 
      ('{"id":10, "number" : ["1.1", "1.2", "1.3"]}',),
      ('{"id":20, "number" : ["2.1", "2.2", "2.3"]}',),
       ]

//Desired output in a dataframe

id	number
10	1.1
10	1.2
10	1.3
20	2.1
20	2.2
20	2.3

I tried withColumn but was only able to split it into 2 columns

df.withColumn("n",from_json(col("_1"),Sch)).select("n.*")

How can I get the 2nd column to split into rows and the first column to be repeated for each number in pyspark?

Any help would be greatly appreciated! TIA!

CodePudding user response：

You may use explode here eg

from pyspark.sql import functions as F

df.withColumn("n",F.from_json(F.col("_1"),Sch))\
  .select("n.id",F.explode("n.number").alias("number"))

Full reproducible example included below:


from pyspark.sql import functions as F
from pyspark.sql import types as T

df = [ 
      ('{"id":10, "number" : ["1.1", "1.2", "1.3"]}',),
      ('{"id":20, "number" : ["2.1", "2.2", "2.3"]}',),
       ]

Sch = T.StructType([
    T.StructField("id",T.IntegerType(),True),
    T.StructField("number",T.ArrayType(T.StringType()),True)
])
df = sparkSession.createDataFrame(df)
df.show()

df.withColumn("n",F.from_json(F.col("_1"),Sch))\
  .select("n.id",F.explode("n.number").alias("number"))\
  .show(truncate=False)

outputs

 ------------------------------------------- 
|_1                                         |
 ------------------------------------------- 
|{"id":10, "number" : ["1.1", "1.2", "1.3"]}|
|{"id":20, "number" : ["2.1", "2.2", "2.3"]}|
 ------------------------------------------- 

 ---- ------ 
|n.id|number|
 ---- ------ 
|10  |1.1   |
|10  |1.2   |
|10  |1.3   |
|20  |2.1   |
|20  |2.2   |
|20  |2.3   |
 ---- ------

Let me know if this works for you.