Home > Enterprise >  PySpark GroupBy agg collect_list multiple columns
PySpark GroupBy agg collect_list multiple columns

Time:02-01

I have working with following code.

df = spark.createDataFrame(data=[["john", "tomato", 1.99, 1],["john", "carrot", 0.45, 1],["bill", "apple", 0.99, 1],["john", "banana", 1.29, 1], ["bill", "taco", 2.59, 1]], schema = ["name", "food", "price", "col_1"])
 ---- ------ ----- ----- 
|name|  food|price|col_1|
 ---- ------ ----- ----- 
|john|tomato| 1.99|    1|
|john|carrot| 0.45|    1|
|bill| apple| 0.99|    1|
|john|banana| 1.29|    1|
|bill|  taco| 2.59|    1|
 ---- ------ ----- ----- 

I am using collect_list with multiple columns with below code.

df.groupBy('name').agg(collect_list(concat_ws(', ','food','price')).alias('sample')).show(10,False)

I getting below output.

 ---- ------------------------------------------ 
|name|sample                                    |
 ---- ------------------------------------------ 
|john|[tomato, 1.99, carrot, 0.45, banana, 1.29]|
|bill|[apple, 0.99, taco, 2.59]                 |
 ---- ------------------------------------------ 

But i need to get the below output.

 ---- ------------------------------------------ 
|name|sample                                    |
 ---- ------------------------------------------ 
|john|tomato 1.99, carrot 0.45, banana 1.29     |
|bill|apple 0.99, taco 2.59                     |
 ---- ------------------------------------------ 

Is there any way to get the above output it will be helpful. I know above code won't get above output.

Could anyone provide the solution which i am expecting.

CodePudding user response:

You can use concat_ws to transform a list into a string using a separator:

df = spark.createDataFrame(
    data=[["john", "tomato", 1.99, 1], ["john", "carrot", 0.45, 1], ["bill", "apple", 0.99, 1],
          ["john", "banana", 1.29, 1], ["bill", "taco", 2.59, 1]], schema=["name", "food", "price", "col_1"])
df = df.groupBy('name').agg(collect_list(concat(col("food"), lit(" "), col("price"))).alias('sample'))\
    .withColumn("sample", concat_ws(",", "sample"))
df.show(truncate=False)

 ---- ----------------------------------- 
|name|sample                             |
 ---- ----------------------------------- 
|john|tomato 1.99,carrot 0.45,banana 1.29|
|bill|apple 0.99,taco 2.59               |
 ---- ----------------------------------- 
  • Related