I have below dataframe.
Col1 Col2
AA_1 S1
ABC S2
BCD S3
BCD S5
PQ_2 S6
XYP S8
XYP S9
I need output in the below format.
data = {'AA_1': '[S1]', 'ABC': '[S2]', 'BCD': '[S3,S5]', 'PQ_2': '[S6]', 'XYP': '[S8,S9]'}
Is there any way to achieve the above output using only PySpark that would be really helpful.
CodePudding user response:
This can be implemented by grouping by col1 and using the aggregate method collect_list to collect col2.
from pyspark.sql.functions import collect_list
data = [
('AA_1', 'S1'),
('ABC', 'S2'),
('BCD', 'S3'),
('BCD', 'S5'),
('PQ_2', 'S6'),
('XYP', 'S8'),
('XYP', 'S9')
]
df = spark.createDataFrame(data, ["col1", "col2"])
data2 = df.groupBy('col1').agg(collect_list('col2').alias('values')).collect()
data3 = {}
for row in data2:
data3[row.col1] = row.values
print(data3)