I need an enumerating regex function that identifies instances in a string when 'Word 1' is within N# words of 'Word 2'
For example, here is my dataframe and objective:
Pandas Dataframe Input
data = [['ABC123', 'This is the first example sentence the end of sentence one'], ['ABC456', 'This is the second example sentence one more sentence to come'], ['ABC789', 'There are no more example sentences']]
df = pd.DataFrame(data, columns=['Record ID', 'String'])
print(df)
Record ID | String
----------|-----------------------
ABC123 | This is the first example sentence the end of sentence one
ABC456 | This is the second example sentence one more sentence to come
ABC456 | There are no more example sentences
Word 1 = 'sentence'
Word 2 = 'the'
Within N# of words (displaced) = 3
Desired Dataframe Output
output_data = [['ABC123', 3], ['ABC456', 1], ['ABC789', 0]]
df = pd.DataFrame(output_data, columns=['Record ID', 'Occurrences Identified'])
print(df)
Record ID | Occurrences Identified
----------|-----------------------
ABC123 | 3
ABC456 | 1
ABC456 | 0
I think the regex part will take the general form of this, but I'm not sure how to apply it towards my use-case here in Python and ... I'm not sure where to start with a enumerate function.
\b(?:'sentence'\W (?:\w \W ){0,3}?'the'|'the'\W (?:\w \W ){0,3}?'sentence')\b
I am also interested in simpler non-regex solutions, if any.
CodePudding user response:
Maybe regex is not the right solution here.
If you split your input string into a list, you can then locate the indices of words 1 and 2, and calculate how far away they are from each other:
string = 'This is the first example sentence the end of sentence one'
string_list = string.split(' ')
indices_word_1 = [i for i, x in enumerate(string_list) if x == "sentence"]
indices_word_2 = [i for i, x in enumerate(string_list) if x == "the"]
result = 0
for i in indices_word_1:
for j in indices_word_2:
_distance = abs(i - j)
if _distance <= 3:
result = 1
In this case the result is 3.
CodePudding user response:
I think you were very close to the solution. The '
in your regex match literal apostrophes. But you don't want to match apostrophes. If you remove them, you end up with a valid pattern:
>>> re.compile(r'\b(?:sentence\W (?:\w \W ){0,3}?the|the\W (?:\w \W ){0,3}?sentence)\b', re.I).finditer("This is the first example sentence the end of sentence one")
[<_sre.SRE_Match object; span=(8, 34), match='the first example sentence'>,
<_sre.SRE_Match object; span=(35, 54), match='the end of sentence'>]
Note, this does only find two occurrences because regex matches don't overlap. If you need overlapping results, you probably better use another solution.
CodePudding user response:
you can use positive look ahead (?=)
to the validate the word sentence
.
import re
text1 = 'This is the first example sentence the end of sentence one'
text2 = 'This is the second example sentence one more sentence to come'
text3 = 'There are no more example sentences'
regex = r'(?:the(?:\s\w ){0,2})\s(?=sentence)|(?:sentence(?:\s\w ){0,2})\s(?=the)'
data = re.findall(regex, text1)
print(data)
print(len(data))
>>> ['the first example ', 'sentence ', 'the end of ']
>>> 3