I have data in following format:
|cust_id |card_num |balance|payment |due |card_type|
|:-------|:--------|:------|:-------|:----|:------- |
|c1 |1234 |567 |344 |33 |A |
|c1 |2345 |57 |44 |3 |B |
|c2 |123 |561 |34 |39 |A |
|c3 |345 |517 |914 |23 |C |
|c3 |127 |56 |34 |32 |B |
|c3 |347 |67 |344 |332 |B |
I want it to be converted into following ArrayType.
|cust_id|card_num |balance |payment |due | card_type|
|:------|:-------- |:------ |:------- |:---- |:---- |
|c1 |[1234,2345] |[567,57] |[344,44] |[33,3] |[A,B] |
|c2 |[123] |[561] |[34] |[39] |[A] |
|c3 |[345,127,347]|[517,56,67]|914,34,344]|[23,32,332]|[C,B,B] |
How to write a generic code in pyspark to do this transformation and save it in csv format?
CodePudding user response:
You just need to group by cust_id
column and use collect_list function to get array type aggregated columns.
df = # input
df.groupBy("cust_id").agg(
collect_list("card_num").alias("card_num"),
collect_list("balance").alias("balance"),
collect_list("payment").alias("payment"),
collect_list("due").alias("due"),
collect_list("card_type").alias("card_type"))