//Input
df = [
('{"id":10, "number" : ["1.1", "1.2", "1.3"]}',),
('{"id":20, "number" : ["2.1", "2.2", "2.3"]}',),
]
//Desired output in a dataframe
id | number |
---|---|
10 | 1.1 |
10 | 1.2 |
10 | 1.3 |
20 | 2.1 |
20 | 2.2 |
20 | 2.3 |
I tried withColumn but was only able to split it into 2 columns
df.withColumn("n",from_json(col("_1"),Sch)).select("n.*")
How can I get the 2nd column to split into rows and the first column to be repeated for each number in pyspark?
Any help would be greatly appreciated! TIA!
CodePudding user response:
You may use explode
here eg
from pyspark.sql import functions as F
df.withColumn("n",F.from_json(F.col("_1"),Sch))\
.select("n.id",F.explode("n.number").alias("number"))
Full reproducible example included below:
from pyspark.sql import functions as F
from pyspark.sql import types as T
df = [
('{"id":10, "number" : ["1.1", "1.2", "1.3"]}',),
('{"id":20, "number" : ["2.1", "2.2", "2.3"]}',),
]
Sch = T.StructType([
T.StructField("id",T.IntegerType(),True),
T.StructField("number",T.ArrayType(T.StringType()),True)
])
df = sparkSession.createDataFrame(df)
df.show()
df.withColumn("n",F.from_json(F.col("_1"),Sch))\
.select("n.id",F.explode("n.number").alias("number"))\
.show(truncate=False)
outputs
-------------------------------------------
|_1 |
-------------------------------------------
|{"id":10, "number" : ["1.1", "1.2", "1.3"]}|
|{"id":20, "number" : ["2.1", "2.2", "2.3"]}|
-------------------------------------------
---- ------
|n.id|number|
---- ------
|10 |1.1 |
|10 |1.2 |
|10 |1.3 |
|20 |2.1 |
|20 |2.2 |
|20 |2.3 |
---- ------
Let me know if this works for you.