Home > Software design >  PySpark DataFrame Converting Row values into column names
PySpark DataFrame Converting Row values into column names

Time:04-12

I have a Pyspark dataframe in the below format:

enter image description here

And I need to convert it into something like this:

enter image description here

Any help ?

CodePudding user response:

Create a new column row_num using row_number and then use pivot. See below logic for details -

Input Data

df = spark.createDataFrame(data = [("Name", "ABC"),
                                   ("Number", "889"),
                                   ("Zip", "99882"),
                                   ("Name", "DEF"), 
                                   ("Number", "998"),
                                   ("Zip", "99880")],
                           schema = ["Header", "Value"]
                          )
df.show()

 ------ ----- 
|Header|Value|
 ------ ----- 
|  Name|  ABC|
|Number|  889|
|   Zip|99882|
|  Name|  DEF|
|Number|  998|
|   Zip|99880|
 ------ ----- 

Now create a new column as row_num using row_number function.

from pyspark.sql.functions import *
from pyspark.sql import Window

df1 = df.withColumn("row_num", row_number().over(Window.partitionBy("Header").orderBy("Value")))

Finally, groupBy this newly created column and use pivot on Header column.

df1.groupBy("row_num").pivot("Header").agg(first("Value")).drop("row_num").show()

 ---- ------ ----- 
|Name|Number|  Zip|
 ---- ------ ----- 
| ABC|   889|99880|
| DEF|   998|99882|
 ---- ------ ----- 
  • Related