Home > Enterprise >  How to add a column to pyspark df, the data format should be a list, and come from grouped data from
How to add a column to pyspark df, the data format should be a list, and come from grouped data from

Time:11-18

I'm new to pyspark, not sure if there's an easy way to do this.

I have a df with people's interests for example:

name interest
A gym
A food
A games
B games

from this df, I would like to create a new one like following:

name interests
A gym;food;games
B games

Can someone help with this? Sorry in advance if i didn't explain clear enough of the question.

CodePudding user response:

You can use concat_ws and collect_list from pyspark.sql.functions:

from pyspark.sql import functions as F

df.groupBy("name").agg(
  F.concat_ws(";", F.collect_list("interest")
             ).alias("interest")).show(truncate=False)

prints:

 ---- -------------- 
|name|interest      |
 ---- -------------- 
|A   |gym;food;games|
|B   |games         |
 ---- -------------- 

Remember to assign it back to a new dataframe

  • concat_ws: Concatenates multiple input string columns together into a single string column, using the given separator.
  • collect_list:

CodePudding user response:

schema = X.schema
X_pd = X.toPandas()
_X = spark.createDataFrame(X_pd,schema=schema)
del X_pd
  • Related