Get distinct count of values in single row in Pyspark DataFrame-CodePudding

I'm trying to split comma separated values in a string column to individual values and count each individual value.

The data I have is formatted as such:

 -------------------- 
|                tags|
 -------------------- 
|cult, horror, got...|
|            violence|
|            romantic|
|inspiring, romant...|
|cruelty, murder, ...|
|romantic, queer, ...|
|gothic, cruelty, ...|
|mystery, suspense...|
|            violence|
|revenge, neo noir...|
 --------------------

And I want the result to look like

 -------------------- ----- 
|                tags|count|
 -------------------- ----- 
|cult                |    4|
|horror              |   10|
|goth                |    4|
|violence            |   30|
...

The code I've tried that hasn't worked is below:

data.select('tags').groupby('tags').count().show(10)

I also used a countdistinct function which also failed to work.

I feel like I need to have a function that separates the values by comma and then lists them but not sure how to execute them.

CodePudding user response：

You can use split() to split strings, then explode(). Finally, groupby and count:

import pyspark.sql.functions as F

df = spark.createDataFrame(data=[
    ["cult,horror"],
    ["cult,comedy"],
    ["romantic,comedy"],
    ["thriler,horror,comedy"],
], schema=["tags"])

df = df \
  .withColumn("tags", F.split("tags", pattern=",")) \
  .withColumn("tags", F.explode("tags"))

df = df.groupBy("tags").count()

[Out]:
 -------- ----- 
|tags    |count|
 -------- ----- 
|romantic|1    |
|thriler |1    |
|horror  |2    |
|cult    |2    |
|comedy  |3    |
 -------- -----