Home > other >  How to convert PySpark dataframe columns into list of dictionary based on groupBy column
How to convert PySpark dataframe columns into list of dictionary based on groupBy column

Time:10-05

I'm converting dataframe columns into list of dictionary.

Input dataframe has 3 columns:

ID  accounts pdct_code
1    100      IN
1    200      CC
2    300      DD
2    400      ZZ
3    500      AA

I need to read this input dataframe and convert it into 3 output rows. The output should look like this:

ID arrayDict
1  [{“accounts”: 100, “pdct_cd”: ’IN’}, {”accounts”: 200, “pdct_cd”: ’CC’}]

Similarly, for ID "2" there should be 1 row with 2 dictionaries with key value pair.

I tried this:

Df1 = df.groupBy("ID").agg(collect_list(struct(col("accounts"), ("pdct_cd"))).alias("array_dict"))

But output is not quite as I wanted which should be a list of dictionary.

CodePudding user response:

What you described (list of dictionary) doesn't exist in Spark. Instead of lists we have arrays, instead of dictionaries we have structs or maps. Since you didn't operate these terms, this will be a loose interpretation of what I think you need.

The following will create arrays of strings. Those strings will have the structure which you probably want.

df.groupBy("ID").agg(F.collect_list(F.to_json(F.struct("accounts", "pdct_code")))

struct() puts your column inside a struct data type.
to_json() creates a JSON string out of the provided struct.
collect_list() is an aggregation function which moves all the strings of the group into an array.

Full example:

from pyspark.sql import functions as F
df = spark.createDataFrame(
    [(1, 100, "IN"),
     (1, 200, "CC"),
     (2, 300, "DD"),
     (2, 400, "ZZ"),
     (3, 500, "AA")],
    ["ID", "accounts", "pdct_code"])

df = df.groupBy("ID").agg(F.collect_list(F.to_json(F.struct("accounts", "pdct_code"))).alias("array_dict"))

df.show(truncate=0)
#  --- ---------------------------------------------------------------------- 
# |ID |array_dict                                                            |
#  --- ---------------------------------------------------------------------- 
# |1  |[{"accounts":100,"pdct_code":"IN"}, {"accounts":200,"pdct_code":"CC"}]|
# |3  |[{"accounts":500,"pdct_code":"AA"}]                                   |
# |2  |[{"accounts":300,"pdct_code":"DD"}, {"accounts":400,"pdct_code":"ZZ"}]|
#  --- ---------------------------------------------------------------------- 
  • Related