Home > Software engineering >  How to Convert Pyspark Dataframe to Dictionary in Python
How to Convert Pyspark Dataframe to Dictionary in Python

Time:03-02

I have below dataframe.

Col1         Col2
AA_1          S1
ABC           S2
BCD           S3
BCD           S5
PQ_2          S6
XYP           S8
XYP           S9

I need output in the below format.

data = {'AA_1': '[S1]', 'ABC': '[S2]', 'BCD': '[S3,S5]', 'PQ_2': '[S6]', 'XYP': '[S8,S9]'}

Is there any way to achieve the above output using only PySpark that would be really helpful.

CodePudding user response:

This can be implemented by grouping by col1 and using the aggregate method collect_list to collect col2.

    from pyspark.sql.functions import collect_list
    data = [
    ('AA_1',  'S1'),
    ('ABC',   'S2'),
    ('BCD',   'S3'),
    ('BCD',   'S5'),
    ('PQ_2',  'S6'),
    ('XYP',   'S8'),
    ('XYP',   'S9')
    ]

        
    df = spark.createDataFrame(data, ["col1", "col2"])
        
    data2 = df.groupBy('col1').agg(collect_list('col2').alias('values')).collect()
        
    data3 = {}
    for row in data2:
      data3[row.col1] = row.values
    
    print(data3)  
  • Related