Home > Software design >  Parse list of Json strings in a dataframe by converting the column into rows using pyspark?
Parse list of Json strings in a dataframe by converting the column into rows using pyspark?

Time:09-24

//Input
df = [ 
      ('{"id":10, "number" : ["1.1", "1.2", "1.3"]}',),
      ('{"id":20, "number" : ["2.1", "2.2", "2.3"]}',),
       ]

//Desired output in a dataframe

id number
10 1.1
10 1.2
10 1.3
20 2.1
20 2.2
20 2.3

I tried withColumn but was only able to split it into 2 columns

df.withColumn("n",from_json(col("_1"),Sch)).select("n.*")

How can I get the 2nd column to split into rows and the first column to be repeated for each number in pyspark?

Any help would be greatly appreciated! TIA!

CodePudding user response:

You may use explode here eg

from pyspark.sql import functions as F

df.withColumn("n",F.from_json(F.col("_1"),Sch))\
  .select("n.id",F.explode("n.number").alias("number"))

Full reproducible example included below:


from pyspark.sql import functions as F
from pyspark.sql import types as T

df = [ 
      ('{"id":10, "number" : ["1.1", "1.2", "1.3"]}',),
      ('{"id":20, "number" : ["2.1", "2.2", "2.3"]}',),
       ]

Sch = T.StructType([
    T.StructField("id",T.IntegerType(),True),
    T.StructField("number",T.ArrayType(T.StringType()),True)
])
df = sparkSession.createDataFrame(df)
df.show()

df.withColumn("n",F.from_json(F.col("_1"),Sch))\
  .select("n.id",F.explode("n.number").alias("number"))\
  .show(truncate=False)

outputs

 ------------------------------------------- 
|_1                                         |
 ------------------------------------------- 
|{"id":10, "number" : ["1.1", "1.2", "1.3"]}|
|{"id":20, "number" : ["2.1", "2.2", "2.3"]}|
 ------------------------------------------- 

 ---- ------ 
|n.id|number|
 ---- ------ 
|10  |1.1   |
|10  |1.2   |
|10  |1.3   |
|20  |2.1   |
|20  |2.2   |
|20  |2.3   |
 ---- ------ 

Let me know if this works for you.

  • Related