How to create new columns by pivoting existing table columns in a pyspark dataframe-CodePudding

I have a pyspark dataframe in the below format

I am tring to create a pyspark dataframe in the below format

I tried doing this by splitting the source dataframe into three by filtering on order value (1,2,3 these are present for the every key) and joined them using a left join using key and was able to get the dataframe in required format.

But is there any better way to solve this in pyspark?

Thanks in advance!

CodePudding user response：

Your ORDER1, ORDER2 and ORDER3 columns are a bit redundant. If you can live without them, you can combine a groupBy and pivot like so:

from pyspark.sql.functions import collect_list

df = spark.createDataFrame(
    [
        (1, 1, "ABC"),
        (1, 2, "BCD"),
        (1, 3, "DEF"),
        (2, 1, "EFG"),
        (2, 2, "GHI"),
        (2, 3, "HIJ"),
        (3, 1, "EFG"),
        (3, 2, "BCD"),
        (3, 3, "HIJ"),
    ],
    ["KEY", "ORDER", "RESP"]
)

output = df.groupBy("KEY").pivot("ORDER").agg(collect_list("RESP"))

output.show()
 --- ----- ----- ----- 
|KEY|    1|    2|    3|
 --- ----- ----- ----- 
|  1|[ABC]|[BCD]|[DEF]|
|  2|[EFG]|[GHI]|[HIJ]|
|  3|[EFG]|[BCD]|[HIJ]|
 --- ----- ----- -----

I'm using collect_list to make sure that if you have more than 1 RESP value for specific combination of KEY and ORDER, that you don't lose it but it just gets put inside of an array. So all the values that you're seeing in this dataframe are arrays (with 1 element for your example).

If you're sure you only have 1 value for a combination of KEY and ORDER and you don't want arrays as values, you can also use first() instead of collect_list():

from pyspark.sql.functions import first

...

output = df.groupBy("KEY").pivot("ORDER").agg(first("RESP"))

output.show()
 --- --- --- --- 
|KEY|  1|  2|  3|
 --- --- --- --- 
|  1|ABC|BCD|DEF|
|  2|EFG|GHI|HIJ|
|  3|EFG|BCD|HIJ|
 --- --- --- ---