Home > Net >  PySpark remove double brackets after collect_set of list
PySpark remove double brackets after collect_set of list

Time:10-06

I want to remove the double brackets after collect_set ?

Input data :

DF = [('1',  '[132]'),
      ('1',  '[184, 88]'),
      ('2',  '[55]'),
      ('2',  '[123,33]'),]

DF = spark.sparkContext.parallelize(DF).toDF(['id', 'codes'])

DF.groupBy("id").agg(F.collect_set("codes").alias("codes_concat")).show(4)
 --- ------------------ 
| id|      codes_concat|
 --- ------------------ 
|  1|[[184, 88], [132]]|
|  2|  [[123,33], [55]]|
 --- ------------------ 

How do I get a simple list instead:

 --- ------------------ 
| id|      codes_concat|
 --- ------------------ 
|  1|  [184, 88, 132]  |
|  2|  [123,33, 55]    |
 --- ------------------ 

CodePudding user response:

You can use the translate function to remove the [ and ] first, and then use the collect_set function to aggregate.

DF.groupBy("id").agg(F.collect_set(F.translate("codes", "[]", "")).alias("codes_concat")).show(4)

CodePudding user response:

Another way

new =(DF.withColumn('codes', regexp_replace('codes','\[|\]',''))#replace double brackets
      .groupBy("id").agg(F.collect_set("codes").alias("codes_concat"))#groupby
     ).show(4)
  • Related