Home > Back-end >  Creating JSON string using two PySpark columns by GroupBy
Creating JSON string using two PySpark columns by GroupBy

Time:04-26

I have a Spark dataframe as below. I want to create a column 'new_col' which groups by all of the columns except 'Code' and 'Department' and assign a JSON structure based on columns 'Code' and 'Department'.

The dataframe needs to be sorted first. Rows 1-3 and 4-5 are duplicated except for columns Code and Department. So I would have JSON created for 1st 3 rows as new_col as {"Code": "A", "Department": "Department Store"}, { "Code": "B","Department": "All Other Suppliers"}, {"Code": "C","Department": "Rest"}

My input dataframe:

enter image description here

Expected output Spark dataframe:

enter image description here

CodePudding user response:

Something like this should do it:

from pyspark.sql import functions as F, Window as W

df = spark.createDataFrame(
    [('XYZ', '324 NW', 'VA', 'A', 'Department Store', 'X', 'Y'),
     ('XYZ', '324 NW', 'VA', 'B', 'All Other Suppliers', 'X', 'Y'),
     ('XYZ', '324 NW', 'VA', 'C', 'Rest', 'X', 'Y'),
     ('ABC', '45 N Ave', 'MA', 'C', 'Rest', 'A', 'A'),
     ('ABC', '45 N Ave', 'MA', 'B', 'All Other Suppliers', 'A', 'A'),
     ('ZXC', '12 SW Street', 'NY', 'A', 'Department Store', 'B', 'Z')],
    ['Name', 'Address', 'State', 'Code', 'Department', 'col1', 'col2']
)

cols = [c for c in df.columns if c not in ['Code', 'Department']]
w1 = W.partitionBy(cols).orderBy('Code')
w2 = W.partitionBy(cols).orderBy(F.desc('Code'))
df = (df
      .withColumn('_rn', F.row_number().over(w1))
      .withColumn('new_col', F.collect_list(F.to_json(F.struct(['Code', 'Department']))).over(w2))
      .withColumn("new_col", F.array_join("new_col", ","))
      .filter('_rn=1')
      .drop('_rn')
)
df.show(truncate=False)
#  ---- ------------ ----- ---- ------------------- ---- ---- ----------------------------------------------------------------------------------------------------------------------------- 
# |Name|Address     |State|Code|Department         |col1|col2|new_col                                                                                                                      |
#  ---- ------------ ----- ---- ------------------- ---- ---- ----------------------------------------------------------------------------------------------------------------------------- 
# |ABC |45 N Ave    |MA   |B   |All Other Suppliers|A   |A   |{"Code":"C","Department":"Rest"},{"Code":"B","Department":"All Other Suppliers"}                                             |
# |XYZ |324 NW      |VA   |A   |Department Store   |X   |Y   |{"Code":"C","Department":"Rest"},{"Code":"B","Department":"All Other Suppliers"},{"Code":"A","Department":"Department Store"}|
# |ZXC |12 SW Street|NY   |A   |Department Store   |B   |Z   |{"Code":"A","Department":"Department Store"}                                                                                 |
#  ---- ------------ ----- ---- ------------------- ---- ---- ----------------------------------------------------------------------------------------------------------------------------- 
  • Related