I want to remove the double brackets after collect_set ?
Input data :
DF = [('1', '[132]'),
('1', '[184, 88]'),
('2', '[55]'),
('2', '[123,33]'),]
DF = spark.sparkContext.parallelize(DF).toDF(['id', 'codes'])
DF.groupBy("id").agg(F.collect_set("codes").alias("codes_concat")).show(4)
--- ------------------
| id| codes_concat|
--- ------------------
| 1|[[184, 88], [132]]|
| 2| [[123,33], [55]]|
--- ------------------
How do I get a simple list instead:
--- ------------------
| id| codes_concat|
--- ------------------
| 1| [184, 88, 132] |
| 2| [123,33, 55] |
--- ------------------
CodePudding user response:
You can use the translate
function to remove the [
and ]
first, and then use the collect_set function to aggregate.
DF.groupBy("id").agg(F.collect_set(F.translate("codes", "[]", "")).alias("codes_concat")).show(4)
CodePudding user response:
Another way
new =(DF.withColumn('codes', regexp_replace('codes','\[|\]',''))#replace double brackets
.groupBy("id").agg(F.collect_set("codes").alias("codes_concat"))#groupby
).show(4)