I'm new to pyspark, not sure if there's an easy way to do this.
I have a df with people's interests for example:
name | interest |
---|---|
A | gym |
A | food |
A | games |
B | games |
from this df, I would like to create a new one like following:
name | interests |
---|---|
A | gym;food;games |
B | games |
Can someone help with this? Sorry in advance if i didn't explain clear enough of the question.
CodePudding user response:
You can use concat_ws
and collect_list
from pyspark.sql.functions
:
from pyspark.sql import functions as F
df.groupBy("name").agg(
F.concat_ws(";", F.collect_list("interest")
).alias("interest")).show(truncate=False)
prints:
---- --------------
|name|interest |
---- --------------
|A |gym;food;games|
|B |games |
---- --------------
Remember to assign it back to a new dataframe
concat_ws
: Concatenates multiple input string columns together into a single string column, using the given separator.collect_list
:
CodePudding user response:
schema = X.schema
X_pd = X.toPandas()
_X = spark.createDataFrame(X_pd,schema=schema)
del X_pd