I have a Spark dataframe as below. I want to create a column 'new_col' which groups by all of the columns except 'Code' and 'Department' and assign a JSON structure based on columns 'Code' and 'Department'.
The dataframe needs to be sorted first. Rows 1-3 and 4-5 are duplicated except for columns Code and Department. So I would have JSON created for 1st 3 rows as new_col as {"Code": "A", "Department": "Department Store"}, { "Code": "B","Department": "All Other Suppliers"}, {"Code": "C","Department": "Rest"}
My input dataframe:
Expected output Spark dataframe:
CodePudding user response:
Something like this should do it:
from pyspark.sql import functions as F, Window as W
df = spark.createDataFrame(
[('XYZ', '324 NW', 'VA', 'A', 'Department Store', 'X', 'Y'),
('XYZ', '324 NW', 'VA', 'B', 'All Other Suppliers', 'X', 'Y'),
('XYZ', '324 NW', 'VA', 'C', 'Rest', 'X', 'Y'),
('ABC', '45 N Ave', 'MA', 'C', 'Rest', 'A', 'A'),
('ABC', '45 N Ave', 'MA', 'B', 'All Other Suppliers', 'A', 'A'),
('ZXC', '12 SW Street', 'NY', 'A', 'Department Store', 'B', 'Z')],
['Name', 'Address', 'State', 'Code', 'Department', 'col1', 'col2']
)
cols = [c for c in df.columns if c not in ['Code', 'Department']]
w1 = W.partitionBy(cols).orderBy('Code')
w2 = W.partitionBy(cols).orderBy(F.desc('Code'))
df = (df
.withColumn('_rn', F.row_number().over(w1))
.withColumn('new_col', F.collect_list(F.to_json(F.struct(['Code', 'Department']))).over(w2))
.withColumn("new_col", F.array_join("new_col", ","))
.filter('_rn=1')
.drop('_rn')
)
df.show(truncate=False)
# ---- ------------ ----- ---- ------------------- ---- ---- -----------------------------------------------------------------------------------------------------------------------------
# |Name|Address |State|Code|Department |col1|col2|new_col |
# ---- ------------ ----- ---- ------------------- ---- ---- -----------------------------------------------------------------------------------------------------------------------------
# |ABC |45 N Ave |MA |B |All Other Suppliers|A |A |{"Code":"C","Department":"Rest"},{"Code":"B","Department":"All Other Suppliers"} |
# |XYZ |324 NW |VA |A |Department Store |X |Y |{"Code":"C","Department":"Rest"},{"Code":"B","Department":"All Other Suppliers"},{"Code":"A","Department":"Department Store"}|
# |ZXC |12 SW Street|NY |A |Department Store |B |Z |{"Code":"A","Department":"Department Store"} |
# ---- ------------ ----- ---- ------------------- ---- ---- -----------------------------------------------------------------------------------------------------------------------------