Home > Net >  Pyspark combine values with same condition into new column as a list
Pyspark combine values with same condition into new column as a list

Time:04-29

I have a table with several repeated values as following:

| name | value |
|------|-------|
| a    | 1     |
| a    | 2     |
| b    | 1     |
| c    | 3     |
| c    | 4     |
| c    | 5     |

I'd like to group this into following format:

| name | value |
|------|-------|
| a    |[1,2]  |
| b    |[1]    |
| c    |[3,4,5]|

Can anyone share a concise way to do this gracefully please? Thanks!

CodePudding user response:

Use the collect_list function.

import pyspark.sql.functions as F

......
df = df.groupBy('name').agg(F.collect_list('value').alias('value'))

CodePudding user response:

You can also consider using collect_set to keep unique values in an array field.

  • Related