Home > database >  How to create pyspark dataframe pivoting by column names
How to create pyspark dataframe pivoting by column names

Time:03-05

I have a dataFrame

 -------- -------------------- 
|      id|                 col|
 -------- -------------------- 
|11111111|[{"orderId":11111...|
 -------- -------------------- 
|22222222|[{"orderId":22222...|
 -------- -------------------- 

col column in JSON format:

[
                {
                    "orderId": 11111,
                    "type": "111",
                    "rate":111,
                    "sum": 111
                },
                {
                    "orderId": 11111,
                    "type": "222",
                    "rate":222,
                    "sum": 222
                }
]

I wanted to group by id, and I don't know how to pivoting this DF. My final result, which i want to see:

 -------- ------- ----- ----- ----- 
|      id|orderId| type| rate| sum |
 -------- ------- ----- ----- ----- 
|11111111|  11111|"111"|  111|  111|
 -------- ------- ----- ----- ----- 
|11111111|  11111|"222"|  222|  222|    
 -------- ------- ----- ----- ----- 
|22222222|  22222|"222"|  222|  222|    
 -------- ------- ----- ----- ----- 

CodePudding user response:

I tried with the below approach and it works. Not sure if its the best approach, I'm open for suggestions and improvements.

Input_df

from pyspark.sql.functions import *
from pyspark.sql.types import *

data = [(11111111, """[{"orderId": 11111,"type": "111","rate":111,"sum": 111} , {"orderId": 11111,"type": "222","rate":222,"sum": 222}]"""), (22222222, """[{"orderId": 22222,"type": "222","rate":222,"sum": 222}]""")]

schema = StructType( [ StructField("id", IntegerType()), StructField("col", StringType()) ] )

df = spark.createDataFrame(data=data,schema=schema)
df.show(truncate=False)

 -------- ----------------------------------------------------------------------------------------------------------------- 
|id      |col                                                                                                              |
 -------- ----------------------------------------------------------------------------------------------------------------- 
|11111111|[{"orderId": 11111,"type": "111","rate":111,"sum": 111} , {"orderId": 11111,"type": "222","rate":222,"sum": 222}]|
|22222222|[{"orderId": 22222,"type": "222","rate":222,"sum": 222}]                                                         |
 -------- ----------------------------------------------------------------------------------------------------------------- 
# Splitting list of dictionaries into rows

df2 = df.withColumn("data",explode(split(regexp_replace(col("col"), "(^\[)|(\]$)", ""), ", "))).withColumn("data",explode(split('data','},'))).withColumn("data",explode(split(regexp_replace(col("data"), "(^\{)|(\}$)", ""), ", "))).withColumn("data_new", regexp_replace(col("data"), '}', ''))

df2 = df2.drop(col("col")).drop(col("data"))
df2.show(truncate=False)

 -------- ----------------------------------------------------- 
|id      |data_new                                             |
 -------- ----------------------------------------------------- 
|11111111|"orderId": 11111,"type": "111","rate":111,"sum": 111 |
|11111111|"orderId": 11111,"type": "222","rate":222,"sum": 222 |
|22222222|"orderId": 22222,"type": "222","rate":222,"sum": 222 |
 -------- ----------------------------------------------------- 
# Getting col_1 value from the seperated dictionary key, value pairs

col_1 = split(df2['data_new'],',')
df3 = df2.withColumn('col_1', col_1.getItem(0)).withColumn("col_2", col_1.getItem(1)).withColumn("col_3", col_1.getItem(3)).drop("data_new")
df3.show(truncate=False)

 -------- ---------------- ------------- ----------- 
|id      |col_1           |col_2        |col_3      |
 -------- ---------------- ------------- ----------- 
|11111111|"orderId": 11111|"type": "111"|"sum": 111 |
|11111111|"orderId": 11111|"type": "222"|"sum": 222 |
|22222222|"orderId": 22222|"type": "222"|"sum": 222 |
 -------- ---------------- ------------- ----------- 
# Splitting on ':' and getting value at index 1 for each of the 3 columns - orderId, type, sum

col_1 = split(df3['col_1'],':')
col_2 = split(df3['col_2'], ':')
col_3 = split(df3['col_3'], ':')

df4 =df3.withColumn('orderId',col_1.getItem(1)).drop(df3.col_1).withColumn("type", col_2.getItem(1)).drop(df3.col_2).withColumn("sum", col_3.getItem(1)).drop(df3.col_3)
df4.show(truncate=False)

 -------- ------- ------ ----- 
|id      |orderId|type  |sum  |
 -------- ------- ------ ----- 
|11111111| 11111 | "111"| 111 |
|11111111| 11111 | "222"| 222 |
|22222222| 22222 | "222"| 222 |
 -------- ------- ------ ----- 

CodePudding user response:

Seems like its not pivoting but extraction Just do Df.select('id', 'col.*')

  • Related