Fetch data using column names stored as values in a column within the same dataframe pyspark-CodePudding

I have a large dataset similar to example below :

ID	CODE	STUDY	AMOUNT	COL_NAME
111	5611	ABCD	56.17	ID
211	5411	GFED	451.1	AMOUNT
311	3212	YTRA	687.3	STUDY

I want to populate the values of the columns stored in col_name in a column(COL_VAL) within the same dataframe as below :

ID	CODE	STUDY	AMOUNT	COL_NAME	COL_VALUE
111	5611	ABCD	56.17	ID	111
211	5411	GFED	451.1	AMOUNT	451.1
311	3212	YTRA	687.3	STUDY	YTRA

I am using a loop and .collect() to populate values but it is taking a lot of time. Would like to know efficient ways to do same relevant for a large dataset.

CodePudding user response：

You can use when to achieve this.

from pyspark.sql import functions as F

data = [(111, 5611, "ABCD", 56.17, "ID",),
(211, 5411, "GFED", 451.1, "AMOUNT",),
(311, 3212, "YTRA", 687.3, "STUDY",),]

df = spark.createDataFrame(data, ("ID", "CODE", "STUDY", "AMOUNT","COL_NAME"))

def derive_column_value():
    condition = F
    for possible_value in df.columns:
        condition = condition.when(F.col("COL_NAME") == possible_value, F.col(possible_value))
    return condition

df.withColumn("COL_VALUE", derive_column_value()).show()

Output

 --- ---- ----- ------ -------- --------- 
| ID|CODE|STUDY|AMOUNT|COL_NAME|COL_VALUE|
 --- ---- ----- ------ -------- --------- 
|111|5611| ABCD| 56.17|      ID|      111|
|211|5411| GFED| 451.1|  AMOUNT|    451.1|
|311|3212| YTRA| 687.3|   STUDY|     YTRA|
 --- ---- ----- ------ -------- ---------