I have a large dataset similar to example below :
ID | CODE | STUDY | AMOUNT | COL_NAME |
---|---|---|---|---|
111 | 5611 | ABCD | 56.17 | ID |
211 | 5411 | GFED | 451.1 | AMOUNT |
311 | 3212 | YTRA | 687.3 | STUDY |
I want to populate the values of the columns stored in col_name in a column(COL_VAL) within the same dataframe as below :
ID | CODE | STUDY | AMOUNT | COL_NAME | COL_VALUE |
---|---|---|---|---|---|
111 | 5611 | ABCD | 56.17 | ID | 111 |
211 | 5411 | GFED | 451.1 | AMOUNT | 451.1 |
311 | 3212 | YTRA | 687.3 | STUDY | YTRA |
I am using a loop and .collect() to populate values but it is taking a lot of time. Would like to know efficient ways to do same relevant for a large dataset.
CodePudding user response:
You can use when
to achieve this.
from pyspark.sql import functions as F
data = [(111, 5611, "ABCD", 56.17, "ID",),
(211, 5411, "GFED", 451.1, "AMOUNT",),
(311, 3212, "YTRA", 687.3, "STUDY",),]
df = spark.createDataFrame(data, ("ID", "CODE", "STUDY", "AMOUNT","COL_NAME"))
def derive_column_value():
condition = F
for possible_value in df.columns:
condition = condition.when(F.col("COL_NAME") == possible_value, F.col(possible_value))
return condition
df.withColumn("COL_VALUE", derive_column_value()).show()
Output
--- ---- ----- ------ -------- ---------
| ID|CODE|STUDY|AMOUNT|COL_NAME|COL_VALUE|
--- ---- ----- ------ -------- ---------
|111|5611| ABCD| 56.17| ID| 111|
|211|5411| GFED| 451.1| AMOUNT| 451.1|
|311|3212| YTRA| 687.3| STUDY| YTRA|
--- ---- ----- ------ -------- ---------