Home > front end >  Find all sentences containing specific words
Find all sentences containing specific words

Time:04-24

I have a string consisting of sentences and want to find all sentences that contain at least one specific keyword, i.e. keyword1 or keyword2:

import re

s = "This is a sentence which contains keyword1. And keyword2 is inside this sentence. "

pattern = re.compile(r"([A-Z][^\.!?].*(keyword1)|(keyword2).*[\.!?])\s")
for match in pattern.findall(s):
    print(match)

Output:

('This is a sentence which contains keyword1', 'keyword1', '')
('keyword2 is inside this sentence. ', '', 'keyword2')

Expected Output:

('This is a sentence which contains keyword1', 'keyword1', '')
('And keyword2 is inside this sentence. ', '', 'keyword2')

As you can see, the second match doesn't contain the whole sentence in the first group. What am I missing here?

CodePudding user response:

You can use a negated character class to not match . ! and ? and put the keywords in the same group to prevent the empty string in the result.

Then re.findall returns the capture group values, which is group 1 for the whole match, and group 2, 3 etc.. for one of the keywords.

([A-Z][^.!?]*(?:(keyword1)|(keyword2))[^.!?]*[.!?])\s

Explanation

  • ( Capture group 1
    • [A-Z][^.!?]* Match an uppercase char A-Z and optionally any char except one of .!?
    • (?:(keyword1)|(keyword2)) Capture one of the keywords in their own group
    • [^.!?]*[.!?] Match any char except one of .!? and then match one of .!?
  • ) Close group 1
  • \s Match a whitespace char

See a regex demo and a Python demo.

Example

import re

s = "This is a sentence which contains keyword1. And keyword2 is inside this sentence. "

pattern = re.compile(r"([A-Z][^.!?]*(?:(keyword1)|(keyword2))[^.!?]*[.!?])\s")
for match in pattern.findall(s):
    print(match)

Output

('This is a sentence which contains keyword1.', 'keyword1', '')
('And keyword2 is inside this sentence.', '', 'keyword2')

CodePudding user response:

You can Try the following:

import re

s = "This is a sentence which contains keyword1. And keyword2 is inside this sentence. "

pattern = re.compile(r"[.?!]*\s*(.*(keyword1)[^.?!]*[.?!]|.*(keyword2)[^.?!]*[.?!])")
for match in pattern.findall(s):
    print(match)

Output:

('This is a sentence which contains keyword1.', 'keyword1', '')
('And keyword2 is inside this sentence.', '', 'keyword2')
  • Related