Home > Software design >  Find word near other word, within N# of words
Find word near other word, within N# of words

Time:01-11

I need an enumerating regex function that identifies instances in a string when 'Word 1' is within N# words of 'Word 2'

For example, here is my dataframe and objective:

Pandas Dataframe Input

data = [['ABC123', 'This is the first example sentence the end of sentence one'], ['ABC456', 'This is the second example sentence one more sentence to come'], ['ABC789', 'There are no more example sentences']]
df = pd.DataFrame(data, columns=['Record ID', 'String'])
print(df)

Record ID | String
----------|-----------------------
ABC123    | This is the first example sentence the end of sentence one
ABC456    | This is the second example sentence one more sentence to come
ABC456    | There are no more example sentences

Word 1 = 'sentence'
Word 2 = 'the'
Within N# of words (displaced) = 3

Desired Dataframe Output

output_data = [['ABC123', 3], ['ABC456', 1], ['ABC789', 0]]
df = pd.DataFrame(output_data, columns=['Record ID', 'Occurrences Identified'])
print(df)

Record ID | Occurrences Identified
----------|-----------------------
ABC123    | 3
ABC456    | 1
ABC456    | 0

I think the regex part will take the general form of this, but I'm not sure how to apply it towards my use-case here in Python and ... I'm not sure where to start with a enumerate function.

\b(?:'sentence'\W (?:\w \W ){0,3}?'the'|'the'\W (?:\w \W ){0,3}?'sentence')\b

I am also interested in simpler non-regex solutions, if any.

CodePudding user response:

Maybe regex is not the right solution here.

If you split your input string into a list, you can then locate the indices of words 1 and 2, and calculate how far away they are from each other:

string = 'This is the first example sentence the end of sentence one'
string_list = string.split(' ')
indices_word_1 = [i for i, x in enumerate(string_list) if x == "sentence"]
indices_word_2 = [i for i, x in enumerate(string_list) if x == "the"]
result = 0
for i in indices_word_1:
    for j in indices_word_2:
        _distance = abs(i - j)
        if _distance <= 3:
            result  = 1

In this case the result is 3.

CodePudding user response:

I think you were very close to the solution. The ' in your regex match literal apostrophes. But you don't want to match apostrophes. If you remove them, you end up with a valid pattern:

>>> re.compile(r'\b(?:sentence\W (?:\w \W ){0,3}?the|the\W (?:\w \W ){0,3}?sentence)\b', re.I).finditer("This is the first example sentence the end of sentence one")
[<_sre.SRE_Match object; span=(8, 34), match='the first example sentence'>,
 <_sre.SRE_Match object; span=(35, 54), match='the end of sentence'>]

Note, this does only find two occurrences because regex matches don't overlap. If you need overlapping results, you probably better use another solution.

CodePudding user response:

you can use positive look ahead (?=) to the validate the word sentence.

import re
text1 = 'This is the first example sentence the end of sentence one'
text2 = 'This is the second example sentence one more sentence to come'
text3 = 'There are no more example sentences'
regex = r'(?:the(?:\s\w ){0,2})\s(?=sentence)|(?:sentence(?:\s\w ){0,2})\s(?=the)'
data = re.findall(regex, text1)
print(data)
print(len(data))
>>> ['the first example ', 'sentence ', 'the end of ']
>>> 3
  • Related