Find word near other word, within N# of words-CodePudding

I need an enumerating regex function that identifies instances in a string when 'Word 1' is within N# words of 'Word 2'

For example, here is my dataframe and objective:

Pandas Dataframe Input

data = [['ABC123', 'This is the first example sentence the end of sentence one'], ['ABC456', 'This is the second example sentence one more sentence to come'], ['ABC789', 'There are no more example sentences']]
df = pd.DataFrame(data, columns=['Record ID', 'String'])
print(df)

Record ID | String
----------|-----------------------
ABC123    | This is the first example sentence the end of sentence one
ABC456    | This is the second example sentence one more sentence to come
ABC456    | There are no more example sentences

Word 1 = 'sentence'
Word 2 = 'the'
Within N# of words (displaced) = 3

Desired Dataframe Output

output_data = [['ABC123', 3], ['ABC456', 1], ['ABC789', 0]]
df = pd.DataFrame(output_data, columns=['Record ID', 'Occurrences Identified'])
print(df)

Record ID | Occurrences Identified
----------|-----------------------
ABC123    | 3
ABC456    | 1
ABC456    | 0

I think the regex part will take the general form of this, but I'm not sure how to apply it towards my use-case here in Python and ... I'm not sure where to start with a enumerate function.

\b(?:'sentence'\W (?:\w \W ){0,3}?'the'|'the'\W (?:\w \W ){0,3}?'sentence')\b

I am also interested in simpler non-regex solutions, if any.

CodePudding user response：

Maybe regex is not the right solution here.

If you split your input string into a list, you can then locate the indices of words 1 and 2, and calculate how far away they are from each other:

string = 'This is the first example sentence the end of sentence one'
string_list = string.split(' ')
indices_word_1 = [i for i, x in enumerate(string_list) if x == "sentence"]
indices_word_2 = [i for i, x in enumerate(string_list) if x == "the"]
result = 0
for i in indices_word_1:
    for j in indices_word_2:
        _distance = abs(i - j)
        if _distance <= 3:
            result  = 1

In this case the result is 3.

CodePudding user response：

I think you were very close to the solution. The ' in your regex match literal apostrophes. But you don't want to match apostrophes. If you remove them, you end up with a valid pattern:

>>> re.compile(r'\b(?:sentence\W (?:\w \W ){0,3}?the|the\W (?:\w \W ){0,3}?sentence)\b', re.I).finditer("This is the first example sentence the end of sentence one")
[<_sre.SRE_Match object; span=(8, 34), match='the first example sentence'>,
 <_sre.SRE_Match object; span=(35, 54), match='the end of sentence'>]

Note, this does only find two occurrences because regex matches don't overlap. If you need overlapping results, you probably better use another solution.

CodePudding user response：

you can use positive look ahead (?=) to the validate the word sentence.

import re
text1 = 'This is the first example sentence the end of sentence one'
text2 = 'This is the second example sentence one more sentence to come'
text3 = 'There are no more example sentences'
regex = r'(?:the(?:\s\w ){0,2})\s(?=sentence)|(?:sentence(?:\s\w ){0,2})\s(?=the)'
data = re.findall(regex, text1)
print(data)
print(len(data))
>>> ['the first example ', 'sentence ', 'the end of ']
>>> 3