How to make dictionary from two pyspark columns keys and values-CodePudding

I'd like to make a dictionary from two columns filtered from DataFrame. Content of the first column should be dictionary's key and from the second should be values of given key.

Example:

keys	vals
203	4
203	3
203	6
412	33
412	123

Such a dataframe I want transform to:

final_dict = {
   "203": [4, 3, 6],
   "412": [33, 123]
}

Is there any fast method to avoid loops? Are they necessary here?

CodePudding user response：

One way to do it is to use the function collect_list to get all the values from a group (use collect_set if you want distinct values instead):

import pyspark.sql.functions as F

lst = df.groupby('keys').agg(F.collect_list('vals').alias('vals')).collect()

print({str(i[0]): i[1] for i in lst})
# {'412': [33, 123], '203': [4, 6, 3]}

Note that the .collect() command could take time if you have a large dataframe.