Extract A Sentence Based on Specific Phrase-CodePudding

I've the following text:

text = "The following leagues were identified: sport: basketball league: N.B.A. sport: soccer league: E.P.L. sport: football league: N.F.L. The other leagues aren't that important."

I need to extract only that sentence which contains a specific phrase (i.e. following leagues were identified). If multiple sentences contain the phrase, then the first sentence with that phrase will suffice. I'm trying the following script:

def extract_sent(text):
  text = text.lower()
  text = re.findall(r'([^.]*' 'leagues were identified' '[^.]*)', text)  
  return text

extract_sent(text)
======================
['the following leagues were identified: sport: basketball league: n']

The code works if any words don't have a period (.) in it. If any words contain a period (i.e. N.B.A., Ref.), it breaks the sentence there. For example, the sentence with N.B.A. gets broken at N.

Desired output:

"The following leagues were identified: sport: basketball league: n.b.a. sport: soccer league: e.p.l. sport: football league: n.f.l."

What would be the smartest way of doing it? Any suggestions would be appreciated. Thanks!

CodePudding user response：

Assuming the sentence break can be defined as the whitespace(s) after a dot and followed by a word staring with a capital letter, would you please try:

import re

text = "The following leagues were identified: sport: basketball league: N.B.A. sport: soccer league: E.P.L. sport: football league: N.F.L. The other leagues aren't that important."

m = re.split(r'(?<=\.)\s (?=[A-Z]\w )', text)
l = list(filter(lambda x: re.search(r'leagues were identified', x, flags=re.IGNORECASE), m))
print(l[0])     # print the first matched sentence

Output:

The following leagues were identified: sport: basketball league: N.B.A. sport: soccer league: E.P.L. sport: football league: N.F.L.

It generates a list of sentences as many as matched. You can make use of all of them as you like.
The definition of the sentence break as noted above will work for the provided example but may be fragile for actual texts. We will need to tweak the regex based on actual samples.

CodePudding user response：

My little contribution:

If you have access and your font support it an alternative would be to use another Unicode character for the dots for the acronyms (for ex. U 2024 ONE DOT LEADER https://www.fileformat.info/info/unicode/char/2024/index.htm)

If the data comes from another source and as you are already doing a lowercase conversion, you could replace the "acronym dots" first to U 2024 like above or even another unused temp character (with this, you will need a list of those acronyms as there is no magic besides maybe the NLP but that's another level of complexity)

also re.search (instead of re.findall) is enough if you're looking only for the first occurence