I have this dataframe below, and I need to get basically one row with all the marks fields concatenated with a delimiter like pipe.
So: PACKAGING MARKS 3|PACKAGING MARKS 2|PACKAG.....
And there can be varying amounts of marks records for each mid.
mid | marksId | id | index | marks |
---|---|---|---|---|
2 | 3 | 3 | 2 | PACKAGING MARKS 3 |
2 | 3 | 3 | 1 | PACKAGING MARKS 2 |
2 | 3 | 3 | 0 | PACKAGING MARKS 1 |
2 | 4 | 4 | 2 | PACKAGING MARKS 23 |
2 | 4 | 4 | 1 | PACKAGING MARKS 22 |
2 | 4 | 4 | 0 | PACKAGING MARKS 21 |
Thanks
CodePudding user response:
Assuming you want 1 delimited string for each "mid", you can collect all "marks" with collect_list()
and use concat_ws()
to create the string:
import pyspark.sql.functions as F
df.groupby('mid').agg(F.concat_ws('|', F.collect_list('marks')).alias('marks_str')).show(truncate=False)