I have such a spark dataframe:
data = [
["A", "false", "B", "true", "C", "false", "D", "false"],
["A", "false", "B", "false", "D", "true", "C", "false"],
["A", "false", "B", "false", "C", "false", "D", "false"],
["A", "true", "C", "true", "B", "false", "D", "false"]
]
columns = ["Label_1_name", "Label_1_value", "Label_2_name", "Label_2_value", "Label_3_name", "Label_3_value", "Label_4_name", "Label_4_value"]
df = spark.createDataFrame(data, columns)
df.show()
------------ ------------- ------------ ------------- ------------ ------------- ------------ -------------
|Label_1_name|Label_1_value|Label_2_name|Label_2_value|Label_3_name|Label_3_value|Label_4_name|Label_4_value|
------------ ------------- ------------ ------------- ------------ ------------- ------------ -------------
| A| false| B| true| C| false| D| false|
| A| false| B| false| D| true| C| false|
| A| false| B| false| C| false| D| false|
| A| true| C| true| B| false| D| false|
------------ ------------- ------------ ------------- ------------ ------------- ------------ -------------
My objective would be to transform this dataframe into a dataframe with only 4 columns, where the column names are "A", "B", "C", and "D", and the column values are 0 (for false) or 1 (for true), depending on what value corresponds to the specific column.
The problem is that the data is dirty, and "Label_1" doesn't necessarily always correspond to column "A" and similarly, "Label_4" doesn't necessarily always correspond to column "D".
Here is the intended spark dataframe output:
---- ------ ------ ------
| A| B| C| D|
---- ------ ------ ------
| 0| 1| 0| 0|
| 0| 0| 0| 1|
| 0| 0| 0| 0|
| 1| 0| 1| 0|
---- ------ ------ ------
CodePudding user response:
Hi I think this should do the trick:
from pyspark.sql import functions as f, DataFrame
label_value_list = ["Label_1_name", "Label_1_value", "Label_2_name", "Label_2_value", "Label_3_name", "Label_3_value",
"Label_4_name", "Label_4_value"]
def create_map(input_df: DataFrame, label_column_list):
return input_df.withColumn("combMap", f.create_map(label_column_list))
def create_cols(letter_list):
for letter in letter_list:
yield f.col(f"combMap.{letter}").cast("boolean").cast("int").alias(letter)
df_with_map = create_map(df, label_value_list)
final_cols = list(create_cols(["A", "B", "C", "D"]))
final_df = df_with_map.select(final_cols)
final_df.show(truncate=False)
output:
--- --- --- ---
|A |B |C |D |
--- --- --- ---
|0 |1 |0 |0 |
|0 |0 |0 |1 |
|0 |0 |0 |0 |
|1 |0 |1 |0 |
--- --- --- ---
Keep in mind that this will not guarantee order. If for some reason you need A:1,0,0,0
instead of A:0,0,0,1
you will need an explicit ordering column.