Home > Software engineering >  Spark DF column to string JSON
Spark DF column to string JSON

Time:05-26

I have a DF like this:

 ------------ ------------------------------------------------------------- 
|pk_attr_name|pk_struct                                                    |
 ------------ ------------------------------------------------------------- 
|CLNT_GRP_CD |{"pk_seq":1,"pk_attr_id":20209,"pk_attr_name":"CLNT_GRP_CD"} |
|IDI_CONTRACT|{"pk_seq":2,"pk_attr_id":45483,"pk_attr_name":"IDI_CONTRACT"}|
 ------------ ------------------------------------------------------------- 

I want to define a JSON string from pk_struct colum. Desired output:

pk_struct_str = '[{"pk_seq":1,"pk_attr_id":20209,"pk_attr_name":"CLNT_GRP_CD"},{"pk_seq":2,"pk_attr_id":45483,"pk_attr_name":"IDI_CONTRACT"}]'

I tried:

pk_df.select(F.to_json(F.struct("pk_struct")).alias("json")).show(truncate=False)

but did not give me the desired result

pk_df.printSchema()
root
 |-- pk_attr_name: string (nullable = true)
 |-- pk_struct: string (nullable = true)

CodePudding user response:

You can achieve this result using collect_list or collect_set function.But it can be used along with aggregate function. So created dummy column and grouped by that column values and in aggregation used collect_list function

df.show(2,False)
df1 = df.withColumn("dummy",lit("XXX"))
df2 = df1.groupBy("dummy").agg(collect_list(df1.pk_struct))
df2.show(2,False)

    
 ------------ ------------------------------------------------------------- 
|pk_attr_name|pk_struct                                                    |
 ------------ ------------------------------------------------------------- 
|CLNT_GRP_CD |{"pk_seq":1,"pk_attr_id":20209,"pk_attr_name":"CLNT_GRP_CD"} |
|IDI_CONTRACT|{"pk_seq":2,"pk_attr_id":45483,"pk_attr_name":"IDI_CONTRACT"}|
 ------------ ------------------------------------------------------------- 

 ----- ----------------------------------------------------------------------------------------------------------------------------- 
|dummy|collect_list(pk_struct)                                                                                                      |
 ----- ----------------------------------------------------------------------------------------------------------------------------- 
|XXX  |[{"pk_seq":1,"pk_attr_id":20209,"pk_attr_name":"CLNT_GRP_CD"}, {"pk_seq":2,"pk_attr_id":45483,"pk_attr_name":"IDI_CONTRACT"}]|
 ----- ----------------------------------------------------------------------------------------------------------------------------- 
  • Related