I have a dataFrame
-------- --------------------
| id| col|
-------- --------------------
|11111111|[{"orderId":11111...|
-------- --------------------
|22222222|[{"orderId":22222...|
-------- --------------------
col column in JSON format:
[
{
"orderId": 11111,
"type": "111",
"rate":111,
"sum": 111
},
{
"orderId": 11111,
"type": "222",
"rate":222,
"sum": 222
}
]
I wanted to group by id, and I don't know how to pivoting this DF. My final result, which i want to see:
-------- ------- ----- ----- -----
| id|orderId| type| rate| sum |
-------- ------- ----- ----- -----
|11111111| 11111|"111"| 111| 111|
-------- ------- ----- ----- -----
|11111111| 11111|"222"| 222| 222|
-------- ------- ----- ----- -----
|22222222| 22222|"222"| 222| 222|
-------- ------- ----- ----- -----
CodePudding user response:
I tried with the below approach and it works. Not sure if its the best approach, I'm open for suggestions and improvements.
Input_df
from pyspark.sql.functions import *
from pyspark.sql.types import *
data = [(11111111, """[{"orderId": 11111,"type": "111","rate":111,"sum": 111} , {"orderId": 11111,"type": "222","rate":222,"sum": 222}]"""), (22222222, """[{"orderId": 22222,"type": "222","rate":222,"sum": 222}]""")]
schema = StructType( [ StructField("id", IntegerType()), StructField("col", StringType()) ] )
df = spark.createDataFrame(data=data,schema=schema)
df.show(truncate=False)
-------- -----------------------------------------------------------------------------------------------------------------
|id |col |
-------- -----------------------------------------------------------------------------------------------------------------
|11111111|[{"orderId": 11111,"type": "111","rate":111,"sum": 111} , {"orderId": 11111,"type": "222","rate":222,"sum": 222}]|
|22222222|[{"orderId": 22222,"type": "222","rate":222,"sum": 222}] |
-------- -----------------------------------------------------------------------------------------------------------------
# Splitting list of dictionaries into rows
df2 = df.withColumn("data",explode(split(regexp_replace(col("col"), "(^\[)|(\]$)", ""), ", "))).withColumn("data",explode(split('data','},'))).withColumn("data",explode(split(regexp_replace(col("data"), "(^\{)|(\}$)", ""), ", "))).withColumn("data_new", regexp_replace(col("data"), '}', ''))
df2 = df2.drop(col("col")).drop(col("data"))
df2.show(truncate=False)
-------- -----------------------------------------------------
|id |data_new |
-------- -----------------------------------------------------
|11111111|"orderId": 11111,"type": "111","rate":111,"sum": 111 |
|11111111|"orderId": 11111,"type": "222","rate":222,"sum": 222 |
|22222222|"orderId": 22222,"type": "222","rate":222,"sum": 222 |
-------- -----------------------------------------------------
# Getting col_1 value from the seperated dictionary key, value pairs
col_1 = split(df2['data_new'],',')
df3 = df2.withColumn('col_1', col_1.getItem(0)).withColumn("col_2", col_1.getItem(1)).withColumn("col_3", col_1.getItem(3)).drop("data_new")
df3.show(truncate=False)
-------- ---------------- ------------- -----------
|id |col_1 |col_2 |col_3 |
-------- ---------------- ------------- -----------
|11111111|"orderId": 11111|"type": "111"|"sum": 111 |
|11111111|"orderId": 11111|"type": "222"|"sum": 222 |
|22222222|"orderId": 22222|"type": "222"|"sum": 222 |
-------- ---------------- ------------- -----------
# Splitting on ':' and getting value at index 1 for each of the 3 columns - orderId, type, sum
col_1 = split(df3['col_1'],':')
col_2 = split(df3['col_2'], ':')
col_3 = split(df3['col_3'], ':')
df4 =df3.withColumn('orderId',col_1.getItem(1)).drop(df3.col_1).withColumn("type", col_2.getItem(1)).drop(df3.col_2).withColumn("sum", col_3.getItem(1)).drop(df3.col_3)
df4.show(truncate=False)
-------- ------- ------ -----
|id |orderId|type |sum |
-------- ------- ------ -----
|11111111| 11111 | "111"| 111 |
|11111111| 11111 | "222"| 222 |
|22222222| 22222 | "222"| 222 |
-------- ------- ------ -----
CodePudding user response:
Seems like its not pivoting but extraction
Just do
Df.select('id', 'col.*')