Convert a pyspark dataframe into a dictionary filtering and collecting values from columns-CodePudding

I need to convert this DataFrame to a dictionary:

ID	value-1	value-2	value-3
1A	Approve	NULL	NULL
2B	Approve	Approve	NULL
3C	NULL	NULL	Approve

output:

{'1A': [value-1], '2B': [value-1,value-2], '3C': [value-3]}

Notice that I am using values of the first column of the DataFrame as keys to the dictionary.

CodePudding user response：

You can use something like this, based on array an array_remove:

from pyspark.sql import functions as F

# Set column to be used to get keys of the dict 
# and columns to be used to compute the values of the dict
dict_key = df.columns[0]
entry_cols = df.columns[1:]

{
    r[dict_key]: r.dict_entry
     for r in (
        df
        .select(
            dict_key,
            F.array_remove(
                F.array(*[
                    F.when(F.col(c) == 'Approve', F.lit(c)).otherwise('NULL')
                    for c in entry_cols
                ]),
                'NULL',
            ).alias('dict_entry')
        )
        .collect()
    )
}

This is the result:

{'1A': ['value-1'], '2B': ['value-1', 'value-2'], '3C': ['value-3']}