I have a table with several repeated values as following:
| name | value |
|------|-------|
| a | 1 |
| a | 2 |
| b | 1 |
| c | 3 |
| c | 4 |
| c | 5 |
I'd like to group this into following format:
| name | value |
|------|-------|
| a |[1,2] |
| b |[1] |
| c |[3,4,5]|
Can anyone share a concise way to do this gracefully please? Thanks!
CodePudding user response:
Use the collect_list
function.
import pyspark.sql.functions as F
......
df = df.groupBy('name').agg(F.collect_list('value').alias('value'))
CodePudding user response:
You can also consider using collect_set to keep unique values in an array field.