pyspark - Dynamically select column content based on other column from the same row-CodePudding

my data frame looks like:

categoryName	catA	catB
catA	0.25	0.75
catB	0.5	0.5

Where categoryName has String type, and cat* are Double. I would like to add column that will contain value from column which name is in the categoryName column:

categoryName	catA	catB	score
catA	0.25	0.75	0.25
catB	0.5	0.7	0.7

in the first row 'score' has value from column name 'catA' in the second row 'score' value from column name 'catB' Thank you

CodePudding user response：

One way is to create a map out of column names and values for each row, and then access the map with the value defined in a desired column.

What's cool about this is that it can work for as many columns as you want.

Example:

from pyspark.sql import SparkSession
import pyspark.sql.functions as F

data = [
    {"categoryName": "catA", "catA": 0.25, "catB": 0.75},
    {"categoryName": "catB", "catA": 0.5, "catB": 0.7},
]

spark = SparkSession.builder.getOrCreate()
df = spark.createDataFrame(data)
df = (
    df.withColumn(
        "map", F.expr("map("   ",".join([f"'{c}', {c}" for c in df.columns])   ")")
    )
    .withColumn("score", F.expr("map[categoryName]"))
    .drop("map")
)

Result:

 ---- ---- ------------ -----                                                   
|catA|catB|categoryName|score|
 ---- ---- ------------ ----- 
|0.25|0.75|catA        |0.25 |
|0.5 |0.7 |catB        |0.7  |
 ---- ---- ------------ -----

CodePudding user response：

Although Vlad Siv's answer is way better than mine (like he mentioned, you can apply his method for any additional columns that you’d want to include in the pyspark dataframe without adding/modifying the code below), I am going to leave the simpler and hardcoded method here in case you want to understand how F.when and F.withColumn functions work with a simpler example.

import pyspark.sql.functions as F
# Assuming that your pyspark dataframe it's named: sdf
sdf = sdf.withColumn(
        "score",
        F.when(
            F.col("categoryName") == "catA", F.col("catA")
        ).otherwise(F.when(
            F.col("categoryName") == "catB", F.col("catB")
        ).otherwise("Unknown Column")),
    )

As I mentioned previously, if you want to add an additional column to your pyspark dataframe, you will have to modify the code including the new columns after the last otherwise so that it also processes it.