Home > other >  Convert row into colums in a pyspark datafrme
Convert row into colums in a pyspark datafrme

Time:11-23

My current pyspark dataframe is like this:

Region  Location    Month       Services    Type    values_in_millions  values_in_percent
USA     USA         1/1/2021    ABC         DC      101537.553          34.775
Europe  Italy       2/1/2021    ABC         DC      434404.87           44.653
Europe  Spain       2/1/2021    ABC         DC      895057.332          21.925
Asia    India       3/1/2021    ABC         DC      211963.21           27.014

My desired dataframe should be in this form:

Region  Location    Month       Services    Type        key_1               values_1    values_2
USA     USA         1/1/2021    ABC         DC          values_in_millions  101537.553  
Europe  Italy       2/1/2021    ABC         DC          values_in_millions  434404.87   
Europe  Spain       2/1/2021    ABC         DC          values_in_millions  895057.332  
Asia    India       3/1/2021    ABC         DC          values_in_millions  211963.21
USA     USA         1/1/2021    ABC         DC          values_in_percent               34.775%
Europe  Italy       2/1/2021    ABC         DC          values_in_percent               44.653%
Europe  Spain       2/1/2021    ABC         DC          values_in_percent               21.925%
Asia    India       3/1/2021    ABC         DC          values_in_percent               27.014%

Any approches will help..

CodePudding user response:

You can create 2 separate dataframes. df1 is is with key_1 = 'value_in_millions' and df2 is with key_1 = 'value_in_percent'. What i'm doing below is first selecting the required columns, hardcoding the values in key_1 column, hardcoding the 'values_1' and 'values_2' columns and finally reselecting the columns so that they are arranged in the same order.

from pyspark.sql.functions import lit

df1 = df.select("Region","Location","Month","Services","Type","value_in_millions").withColumn("key_1",lit("value_in_millions")).withColumn("values_2",lit("")).withColumnRenamed("value_in_millions", "values_1").select("Region","Location","Month","Services","Type","key_1","values_1","values_2")
df2 = df.select("Region","Location","Month","Services","Type","value_in_percent").withColumn("key_1",lit("value_in_percent")).withColumn("values_1",lit("")).withColumnRenamed("value_in_percent", "values_2").select("Region","Location","Month","Services","Type","key_1","values_1","values_2")

df1.show()
df2.show()

Output is below.

 ------ -------- ---------- -------- ---- ----------------- -------- -------- 
|Region|Location|     Month|Services|Type|            key_1|values_1|values_2|
 ------ -------- ---------- -------- ---- ----------------- -------- -------- 
|   USA|     USA|2001-01-01|     ABC|  DC|value_in_millions|  100000|        |
|   IND|     DLH|2001-01-01|     ABC|  DC|value_in_millions|  200000|        |
|   NYC|     NYC|2001-01-01|     ABC|  DC|value_in_millions|  300000|        |
|    UK|   WALES|2001-01-01|     ABC|  DC|value_in_millions|  400000|        |
 ------ -------- ---------- -------- ---- ----------------- -------- -------- 

 ------ -------- ---------- -------- ---- ---------------- -------- -------- 
|Region|Location|     Month|Services|Type|           key_1|values_1|values_2|
 ------ -------- ---------- -------- ---- ---------------- -------- -------- 
|   USA|     USA|2001-01-01|     ABC|  DC|value_in_percent|        |      34|
|   IND|     DLH|2001-01-01|     ABC|  DC|value_in_percent|        |      35|
|   NYC|     NYC|2001-01-01|     ABC|  DC|value_in_percent|        |      36|
|    UK|   WALES|2001-01-01|     ABC|  DC|value_in_percent|        |      37|
 ------ -------- ---------- -------- ---- ---------------- -------- -------- 

Once they are reordered in the same order, I can do a union on the 2 dataframes.

from functools import reduce  # For Python 3.x
from pyspark.sql import DataFrame
  
def unionAll(*dfs):
    return reduce(DataFrame.unionAll, dfs)
  
df3 = unionAll(df1, df2)
df3.show()

Output in below.

 ------ -------- ---------- -------- ---- ----------- -------- -------- 
|Region|Location|     Month|Services|Type|       key1|values_1|values_2|
 ------ -------- ---------- -------- ---- ----------- -------- -------- 
|   USA|     USA|2001-01-01|     ABC|  DC|valueinmill|  100000|        |
|   IND|     DLH|2001-01-01|     ABC|  DC|valueinmill|  200000|        |
|   NYC|     NYC|2001-01-01|     ABC|  DC|valueinmill|  300000|        |
|    UK|   WALES|2001-01-01|     ABC|  DC|valueinmill|  400000|        |
|   USA|     USA|2001-01-01|     ABC|  DC| valueinpct|        |      34|
|   IND|     DLH|2001-01-01|     ABC|  DC| valueinpct|        |      35|
|   NYC|     NYC|2001-01-01|     ABC|  DC| valueinpct|        |      36|
|    UK|   WALES|2001-01-01|     ABC|  DC| valueinpct|        |      37|
 ------ -------- ---------- -------- ---- ----------- -------- -------- 
  • Related