Home > Back-end >  Get distinct count of values in single row in Pyspark DataFrame
Get distinct count of values in single row in Pyspark DataFrame

Time:11-17

I'm trying to split comma separated values in a string column to individual values and count each individual value.

The data I have is formatted as such:

 -------------------- 
|                tags|
 -------------------- 
|cult, horror, got...|
|            violence|
|            romantic|
|inspiring, romant...|
|cruelty, murder, ...|
|romantic, queer, ...|
|gothic, cruelty, ...|
|mystery, suspense...|
|            violence|
|revenge, neo noir...|
 -------------------- 

And I want the result to look like

 -------------------- ----- 
|                tags|count|
 -------------------- ----- 
|cult                |    4|
|horror              |   10|
|goth                |    4|
|violence            |   30|
...

The code I've tried that hasn't worked is below:

data.select('tags').groupby('tags').count().show(10)

I also used a countdistinct function which also failed to work.

I feel like I need to have a function that separates the values by comma and then lists them but not sure how to execute them.

CodePudding user response:

You can use split() to split strings, then explode(). Finally, groupby and count:

import pyspark.sql.functions as F

df = spark.createDataFrame(data=[
    ["cult,horror"],
    ["cult,comedy"],
    ["romantic,comedy"],
    ["thriler,horror,comedy"],
], schema=["tags"])

df = df \
  .withColumn("tags", F.split("tags", pattern=",")) \
  .withColumn("tags", F.explode("tags"))

df = df.groupBy("tags").count()

[Out]:
 -------- ----- 
|tags    |count|
 -------- ----- 
|romantic|1    |
|thriler |1    |
|horror  |2    |
|cult    |2    |
|comedy  |3    |
 -------- ----- 
  • Related