My current pyspark dataframe is like this:
Region Location Month Services Type values_in_millions values_in_percent
USA USA 1/1/2021 ABC DC 101537.553 34.775
Europe Italy 2/1/2021 ABC DC 434404.87 44.653
Europe Spain 2/1/2021 ABC DC 895057.332 21.925
Asia India 3/1/2021 ABC DC 211963.21 27.014
My desired dataframe should be in this form:
Region Location Month Services Type key_1 values_1 values_2
USA USA 1/1/2021 ABC DC values_in_millions 101537.553
Europe Italy 2/1/2021 ABC DC values_in_millions 434404.87
Europe Spain 2/1/2021 ABC DC values_in_millions 895057.332
Asia India 3/1/2021 ABC DC values_in_millions 211963.21
USA USA 1/1/2021 ABC DC values_in_percent 34.775%
Europe Italy 2/1/2021 ABC DC values_in_percent 44.653%
Europe Spain 2/1/2021 ABC DC values_in_percent 21.925%
Asia India 3/1/2021 ABC DC values_in_percent 27.014%
Any approches will help..
CodePudding user response:
You can create 2 separate dataframes. df1 is is with key_1 = 'value_in_millions' and df2 is with key_1 = 'value_in_percent'. What i'm doing below is first selecting the required columns, hardcoding the values in key_1 column, hardcoding the 'values_1' and 'values_2' columns and finally reselecting the columns so that they are arranged in the same order.
from pyspark.sql.functions import lit
df1 = df.select("Region","Location","Month","Services","Type","value_in_millions").withColumn("key_1",lit("value_in_millions")).withColumn("values_2",lit("")).withColumnRenamed("value_in_millions", "values_1").select("Region","Location","Month","Services","Type","key_1","values_1","values_2")
df2 = df.select("Region","Location","Month","Services","Type","value_in_percent").withColumn("key_1",lit("value_in_percent")).withColumn("values_1",lit("")).withColumnRenamed("value_in_percent", "values_2").select("Region","Location","Month","Services","Type","key_1","values_1","values_2")
df1.show()
df2.show()
Output is below.
------ -------- ---------- -------- ---- ----------------- -------- --------
|Region|Location| Month|Services|Type| key_1|values_1|values_2|
------ -------- ---------- -------- ---- ----------------- -------- --------
| USA| USA|2001-01-01| ABC| DC|value_in_millions| 100000| |
| IND| DLH|2001-01-01| ABC| DC|value_in_millions| 200000| |
| NYC| NYC|2001-01-01| ABC| DC|value_in_millions| 300000| |
| UK| WALES|2001-01-01| ABC| DC|value_in_millions| 400000| |
------ -------- ---------- -------- ---- ----------------- -------- --------
------ -------- ---------- -------- ---- ---------------- -------- --------
|Region|Location| Month|Services|Type| key_1|values_1|values_2|
------ -------- ---------- -------- ---- ---------------- -------- --------
| USA| USA|2001-01-01| ABC| DC|value_in_percent| | 34|
| IND| DLH|2001-01-01| ABC| DC|value_in_percent| | 35|
| NYC| NYC|2001-01-01| ABC| DC|value_in_percent| | 36|
| UK| WALES|2001-01-01| ABC| DC|value_in_percent| | 37|
------ -------- ---------- -------- ---- ---------------- -------- --------
Once they are reordered in the same order, I can do a union on the 2 dataframes.
from functools import reduce # For Python 3.x
from pyspark.sql import DataFrame
def unionAll(*dfs):
return reduce(DataFrame.unionAll, dfs)
df3 = unionAll(df1, df2)
df3.show()
Output in below.
------ -------- ---------- -------- ---- ----------- -------- --------
|Region|Location| Month|Services|Type| key1|values_1|values_2|
------ -------- ---------- -------- ---- ----------- -------- --------
| USA| USA|2001-01-01| ABC| DC|valueinmill| 100000| |
| IND| DLH|2001-01-01| ABC| DC|valueinmill| 200000| |
| NYC| NYC|2001-01-01| ABC| DC|valueinmill| 300000| |
| UK| WALES|2001-01-01| ABC| DC|valueinmill| 400000| |
| USA| USA|2001-01-01| ABC| DC| valueinpct| | 34|
| IND| DLH|2001-01-01| ABC| DC| valueinpct| | 35|
| NYC| NYC|2001-01-01| ABC| DC| valueinpct| | 36|
| UK| WALES|2001-01-01| ABC| DC| valueinpct| | 37|
------ -------- ---------- -------- ---- ----------- -------- --------