I'd like to make a dictionary from two columns filtered from DataFrame. Content of the first column should be dictionary's key and from the second should be values of given key.
Example:
keys | vals |
---|---|
203 | 4 |
203 | 3 |
203 | 6 |
412 | 33 |
412 | 123 |
Such a dataframe I want transform to:
final_dict = {
"203": [4, 3, 6],
"412": [33, 123]
}
Is there any fast method to avoid loops? Are they necessary here?
CodePudding user response:
One way to do it is to use the function collect_list
to get all the values from a group (use collect_set
if you want distinct values instead):
import pyspark.sql.functions as F
lst = df.groupby('keys').agg(F.collect_list('vals').alias('vals')).collect()
print({str(i[0]): i[1] for i in lst})
# {'412': [33, 123], '203': [4, 6, 3]}
Note that the .collect()
command could take time if you have a large dataframe.