I'm trying to split comma separated values in a string column to individual values and count each individual value.
The data I have is formatted as such:
--------------------
| tags|
--------------------
|cult, horror, got...|
| violence|
| romantic|
|inspiring, romant...|
|cruelty, murder, ...|
|romantic, queer, ...|
|gothic, cruelty, ...|
|mystery, suspense...|
| violence|
|revenge, neo noir...|
--------------------
And I want the result to look like
-------------------- -----
| tags|count|
-------------------- -----
|cult | 4|
|horror | 10|
|goth | 4|
|violence | 30|
...
The code I've tried that hasn't worked is below:
data.select('tags').groupby('tags').count().show(10)
I also used a countdistinct function which also failed to work.
I feel like I need to have a function that separates the values by comma and then lists them but not sure how to execute them.
CodePudding user response:
You can use split() to split strings, then explode(). Finally, groupby and count:
import pyspark.sql.functions as F
df = spark.createDataFrame(data=[
["cult,horror"],
["cult,comedy"],
["romantic,comedy"],
["thriler,horror,comedy"],
], schema=["tags"])
df = df \
.withColumn("tags", F.split("tags", pattern=",")) \
.withColumn("tags", F.explode("tags"))
df = df.groupBy("tags").count()
[Out]:
-------- -----
|tags |count|
-------- -----
|romantic|1 |
|thriler |1 |
|horror |2 |
|cult |2 |
|comedy |3 |
-------- -----