How to Convert Pyspark Dataframe to Dictionary in Python-CodePudding

I have below dataframe.

Col1         Col2
AA_1          S1
ABC           S2
BCD           S3
BCD           S5
PQ_2          S6
XYP           S8
XYP           S9

I need output in the below format.

data = {'AA_1': '[S1]', 'ABC': '[S2]', 'BCD': '[S3,S5]', 'PQ_2': '[S6]', 'XYP': '[S8,S9]'}

Is there any way to achieve the above output using only PySpark that would be really helpful.

CodePudding user response：

This can be implemented by grouping by col1 and using the aggregate method collect_list to collect col2.

    from pyspark.sql.functions import collect_list
    data = [
    ('AA_1',  'S1'),
    ('ABC',   'S2'),
    ('BCD',   'S3'),
    ('BCD',   'S5'),
    ('PQ_2',  'S6'),
    ('XYP',   'S8'),
    ('XYP',   'S9')
    ]

        
    df = spark.createDataFrame(data, ["col1", "col2"])
        
    data2 = df.groupBy('col1').agg(collect_list('col2').alias('values')).collect()
        
    data3 = {}
    for row in data2:
      data3[row.col1] = row.values
    
    print(data3)