How to add a column to pyspark df, the data format should be a list, and come from grouped data from-CodePudding

I'm new to pyspark, not sure if there's an easy way to do this.

I have a df with people's interests for example:

name	interest
A	gym
A	food
A	games
B	games

from this df, I would like to create a new one like following:

name	interests
A	gym;food;games
B	games

Can someone help with this? Sorry in advance if i didn't explain clear enough of the question.

CodePudding user response：

You can use concat_ws and collect_list from pyspark.sql.functions:

from pyspark.sql import functions as F

df.groupBy("name").agg(
  F.concat_ws(";", F.collect_list("interest")
             ).alias("interest")).show(truncate=False)

prints:

 ---- -------------- 
|name|interest      |
 ---- -------------- 
|A   |gym;food;games|
|B   |games         |
 ---- --------------

Remember to assign it back to a new dataframe

concat_ws: Concatenates multiple input string columns together into a single string column, using the given separator.
collect_list:

CodePudding user response：

schema = X.schema
X_pd = X.toPandas()
_X = spark.createDataFrame(X_pd,schema=schema)
del X_pd