I would like to see if a string column is contained in another column as a whole word. There are few approaches like using contains
as described here or using array_contains
as described here.
The first approach fails in the following edge case:
--------- -----------------------
|candidate| sentence |
--------- -----------------------
| su |We saw the survivors. |
--------- -----------------------
su
should be found as a separate word and not as a pure substring of the sentence
column.
The second approach fails when the candidate is a compound word. An example is:
---------------- ------------------------
|candidate | sentence |
---------------- ------------------------
| Roman emperor | He was a Roman emperor.|
---------------- ------------------------
The second approach fails here because it turns the sentence column to an array of tokens: [He, was, a, Roman, emperor]
and none of them is equal to Roman emperor
.
Is there any way to resolve this issue?
CodePudding user response:
This probably still has edge cases but I hope you get some ideas.
I would use regex_extract
to match the candidate against the sentence.
First, I convert the candidate to regex (ie, convert space to \s), then use regex_extract
with word boundary (\b).
df = (df.withColumn('regex', F.regexp_replace(F.col('candidate'), ' ', '\\\s'))
.withColumn('match', F.expr(r"regexp_extract(sentence, concat('\\b', regex, '\\b'), 0)")))
Result
------------- ----------------------- -------------- -------------
| candidate| sentence| regex| match|
------------- ----------------------- -------------- -------------
| su| We saw the survivors.| su| |
|Roman emperor|He was a Roman emperor.|Roman\semperor|Roman emperor|
------------- ----------------------- -------------- -------------