how to transpose data in PySpark using dynamic function-CodePudding

There will be multiple columns and multiple rows.

------- ------ ------- 
|Product|Amount|Country|
 ------- ------ ------- 
|Banana |1000  |USA    |
|Carrots|1500  |USA    |
|Beans  |1600  |USA    |
|Orange |2000  |USA    |
|Orange |2000  |USA    |
|Banana |400   |China  |
|Carrots|1200  |China  |
|Beans  |1500  |China  |
|Orange |4000  |China  |
|Banana |2000  |Canada |
|Carrots|2000  |Canada |
|Beans  |2000  |Mexico |

Input will 5 columns and 3 rows and output will 3 rows and 5 columns

I am expecting that i need dynamic function that will take dataframe as input and will automatically will transpose data. we just need to call that dataframe in that function

CodePudding user response：

Within pyspark this would be the .pivot() function with the parameter being the column you want to pivot it by.

This data sample looks like the same one used here https://sparkbyexamples.com/pyspark/pyspark-pivot-and-unpivot-dataframe which showcases how to do what you're asking for

Edit: In order to use the .pivot() function you first need to aggregate the dataframe which is shown in the linked example

CodePudding user response：

df.groupBy("Product").pivot("Country").sum("Amount").show()