I need to convert this DataFrame to a dictionary:
ID | value-1 | value-2 | value-3 |
---|---|---|---|
1A | Approve | NULL | NULL |
2B | Approve | Approve | NULL |
3C | NULL | NULL | Approve |
output:
{'1A': [value-1], '2B': [value-1,value-2], '3C': [value-3]}
Notice that I am using values of the first column of the DataFrame as keys to the dictionary.
CodePudding user response:
You can use something like this, based on array
an array_remove
:
from pyspark.sql import functions as F
# Set column to be used to get keys of the dict
# and columns to be used to compute the values of the dict
dict_key = df.columns[0]
entry_cols = df.columns[1:]
{
r[dict_key]: r.dict_entry
for r in (
df
.select(
dict_key,
F.array_remove(
F.array(*[
F.when(F.col(c) == 'Approve', F.lit(c)).otherwise('NULL')
for c in entry_cols
]),
'NULL',
).alias('dict_entry')
)
.collect()
)
}
This is the result:
{'1A': ['value-1'], '2B': ['value-1', 'value-2'], '3C': ['value-3']}