Home > Software design >  Matching spark dataframe cell contents to column values in PySpark
Matching spark dataframe cell contents to column values in PySpark

Time:10-07

I have such a spark dataframe:

data = [
  ["A", "false", "B", "true", "C", "false", "D", "false"],
  ["A", "false", "B", "false", "D", "true", "C", "false"],
  ["A", "false", "B", "false", "C", "false", "D", "false"],
  ["A", "true", "C", "true", "B", "false", "D", "false"]
]
columns = ["Label_1_name", "Label_1_value", "Label_2_name", "Label_2_value", "Label_3_name", "Label_3_value", "Label_4_name", "Label_4_value"]

df = spark.createDataFrame(data, columns)
df.show()

 ------------ ------------- ------------ ------------- ------------ ------------- ------------ ------------- 
|Label_1_name|Label_1_value|Label_2_name|Label_2_value|Label_3_name|Label_3_value|Label_4_name|Label_4_value|
 ------------ ------------- ------------ ------------- ------------ ------------- ------------ ------------- 
|           A|        false|           B|         true|           C|        false|           D|        false|
|           A|        false|           B|        false|           D|         true|           C|        false|
|           A|        false|           B|        false|           C|        false|           D|        false|
|           A|         true|           C|         true|           B|        false|           D|        false|
 ------------ ------------- ------------ ------------- ------------ ------------- ------------ ------------- 

My objective would be to transform this dataframe into a dataframe with only 4 columns, where the column names are "A", "B", "C", and "D", and the column values are 0 (for false) or 1 (for true), depending on what value corresponds to the specific column.

The problem is that the data is dirty, and "Label_1" doesn't necessarily always correspond to column "A" and similarly, "Label_4" doesn't necessarily always correspond to column "D".

Here is the intended spark dataframe output:

 ---- ------ ------ ------ 
|   A|     B|     C|     D|
 ---- ------ ------ ------ 
|   0|     1|     0|     0|
|   0|     0|     0|     1|
|   0|     0|     0|     0|
|   1|     0|     1|     0|
 ---- ------ ------ ------ 

CodePudding user response:

Hi I think this should do the trick:

from pyspark.sql import functions as f, DataFrame    
label_value_list = ["Label_1_name", "Label_1_value", "Label_2_name", "Label_2_value", "Label_3_name", "Label_3_value",
            "Label_4_name", "Label_4_value"]
    
    
def create_map(input_df: DataFrame, label_column_list):
    return input_df.withColumn("combMap", f.create_map(label_column_list))


def create_cols(letter_list):
    for letter in letter_list:
        yield f.col(f"combMap.{letter}").cast("boolean").cast("int").alias(letter)


df_with_map = create_map(df, label_value_list)
final_cols = list(create_cols(["A", "B", "C", "D"]))
final_df = df_with_map.select(final_cols)

final_df.show(truncate=False)

output:

 --- --- --- --- 
|A  |B  |C  |D  |
 --- --- --- --- 
|0  |1  |0  |0  |
|0  |0  |0  |1  |
|0  |0  |0  |0  |
|1  |0  |1  |0  |
 --- --- --- --- 

Keep in mind that this will not guarantee order. If for some reason you need A:1,0,0,0 instead of A:0,0,0,1 you will need an explicit ordering column.

  • Related