I have a PySpark DataFrame with a string column text
and a separate list word_list
and I need to count how many of the word_list
values appear in each text
row (can be counted more than once).
df = spark.createDataFrame(
[(1,'Hello my name is John'),
(2,'Yo go Bengals'),
(3,'this is a text')
]
, ['id','text']
)
word_list = ['is', 'm', 'o', 'my']
The result would be:
| text | list_count |
| Hello my name is John | 6 |
| Yo go Bengals | 2 |
| this is a text | 2 |
For text's first value, "is" occurs once, "m" occurs twice, "o" occurs twice, and "my" occurs once. In the second row, the only value from word_list
that appears is "o" and it appears twice. In the third value for text
, the only value from word_list
that appears is "is" and it appears twice.
The result doesn't necessarily have to be PySpark-based either, it could be in Pandas if that's easier.
CodePudding user response:
You can do this with a UDF as below
UDF
df = sql.createDataFrame(
[(1,'Hello my name is John'),
(2,'Yo go Bengals'),
(3,'this is a text')
]
, ['id','text']
)
word_list = ['is', 'm', 'o', 'my']
def count_values(inp,map_list=None):
count = 0
for pattern in map_list:
if re.findall(pattern,inp):
count = 1
return count
count_values_udf = F.udf(partial(count_values,map_list=word_list),IntegerType())
df.select(
count_values_udf(F.col('text')).alias('count_values')
).show()
------------
|count_values|
------------
| 4|
| 1|
| 1|
------------
CodePudding user response:
To count number of occurrences of a substring in string column, you can split that column by the substring. The count corresponds to the size of resulting array minus 1.
So in your case, you can use aggregate
function on the word_list
array column and for each element, split the text
column and get the size - 1
:
from pyspark.sql import functions as F
result = df.withColumn(
"word_list",
F.array(*[F.lit(x) for x in word_list])
).withColumn(
"list_count",
F.expr("aggregate(word_list, 0, (acc, x) -> acc size(split(text, x)) -1)")
).drop("word_list")
result.show(truncate=False)
# --- --------------------- ----------
#|id |text |list_count|
# --- --------------------- ----------
#|1 |Hello my name is John|6 |
#|2 |Yo go Bengals |2 |
#|3 |this is a text |2 |
# --- --------------------- ----------