Home > Net >  Fetch data using column names stored as values in a column within the same dataframe pyspark
Fetch data using column names stored as values in a column within the same dataframe pyspark

Time:11-25

I have a large dataset similar to example below :

ID CODE STUDY AMOUNT COL_NAME
111 5611 ABCD 56.17 ID
211 5411 GFED 451.1 AMOUNT
311 3212 YTRA 687.3 STUDY

I want to populate the values of the columns stored in col_name in a column(COL_VAL) within the same dataframe as below :

ID CODE STUDY AMOUNT COL_NAME COL_VALUE
111 5611 ABCD 56.17 ID 111
211 5411 GFED 451.1 AMOUNT 451.1
311 3212 YTRA 687.3 STUDY YTRA

I am using a loop and .collect() to populate values but it is taking a lot of time. Would like to know efficient ways to do same relevant for a large dataset.

CodePudding user response:

You can use when to achieve this.

from pyspark.sql import functions as F

data = [(111, 5611, "ABCD", 56.17, "ID",),
(211, 5411, "GFED", 451.1, "AMOUNT",),
(311, 3212, "YTRA", 687.3, "STUDY",),]

df = spark.createDataFrame(data, ("ID", "CODE", "STUDY", "AMOUNT","COL_NAME"))

def derive_column_value():
    condition = F
    for possible_value in df.columns:
        condition = condition.when(F.col("COL_NAME") == possible_value, F.col(possible_value))
    return condition

df.withColumn("COL_VALUE", derive_column_value()).show()

Output

 --- ---- ----- ------ -------- --------- 
| ID|CODE|STUDY|AMOUNT|COL_NAME|COL_VALUE|
 --- ---- ----- ------ -------- --------- 
|111|5611| ABCD| 56.17|      ID|      111|
|211|5411| GFED| 451.1|  AMOUNT|    451.1|
|311|3212| YTRA| 687.3|   STUDY|     YTRA|
 --- ---- ----- ------ -------- --------- 
  • Related