Count occurrences of list of values in column using PySpark DataFrame-CodePudding

I have a PySpark DataFrame with a string column text and a separate list word_list and I need to count how many of the word_list values appear in each text row (can be counted more than once).

df = spark.createDataFrame(
  [(1,'Hello my name is John'), 
   (2,'Yo go Bengals'), 
   (3,'this is a text')
  ]
  , ['id','text']
)

word_list = ['is', 'm', 'o', 'my']

The result would be:

| text                  | list_count   |
| Hello my name is John |       6      |
| Yo go Bengals         |       2      |
| this is a text        |       2      |

For text's first value, "is" occurs once, "m" occurs twice, "o" occurs twice, and "my" occurs once. In the second row, the only value from word_list that appears is "o" and it appears twice. In the third value for text, the only value from word_list that appears is "is" and it appears twice.

The result doesn't necessarily have to be PySpark-based either, it could be in Pandas if that's easier.

CodePudding user response：

You can do this with a UDF as below

UDF

df = sql.createDataFrame(
  [(1,'Hello my name is John'), 
   (2,'Yo go Bengals'), 
   (3,'this is a text')
  ]
  , ['id','text']
)

word_list = ['is', 'm', 'o', 'my']

def count_values(inp,map_list=None):
    
    count = 0
    
    for pattern in map_list:
        if re.findall(pattern,inp):
            count  = 1
    
    return count

count_values_udf = F.udf(partial(count_values,map_list=word_list),IntegerType())

df.select(
            count_values_udf(F.col('text')).alias('count_values')
         ).show()

 ------------ 
|count_values|
 ------------ 
|           4|
|           1|
|           1|
 ------------

CodePudding user response：

To count number of occurrences of a substring in string column, you can split that column by the substring. The count corresponds to the size of resulting array minus 1.

So in your case, you can use aggregate function on the word_list array column and for each element, split the text column and get the size - 1:

from pyspark.sql import functions as F

result = df.withColumn(
    "word_list",
    F.array(*[F.lit(x) for x in word_list])
).withColumn(
    "list_count",
    F.expr("aggregate(word_list, 0, (acc, x) -> acc   size(split(text, x)) -1)")
).drop("word_list")

result.show(truncate=False)
# --- --------------------- ---------- 
#|id |text                 |list_count|
# --- --------------------- ---------- 
#|1  |Hello my name is John|6         |
#|2  |Yo go Bengals        |2         |
#|3  |this is a text       |2         |
# --- --------------------- ----------