How to extract a sentence with start marker?-CodePudding

I have a list of strings. I want to extract from each the first sentence starting with "I am". For example:

"I am OK. How about you?" -> "I am OK."

"I think I am better than that." -> "I am better than that."

I'm using a very simple function that works most of the time:

def extract_sentence(text):
    start_idx = text.find("I am")
    dot_idx = text[pred_idx:].find(".")

    return text[start_idx:start_idx   dot_idx   1]

But it won't work in some instances. For example:

"I am enjoying https://stackoverflow.com/. People are very helpful there." -> "I am enjoing https://stackoverflow." (should have been "I am enjoying https://stackoverflow.com/.")

"I am worried about Josh" -> "" (no ending dot, should have been a copy of the input, i.e. "I am worried about Josh").

I believe some regex would help me here, but I don't know how to start. Any ideas?

CodePudding user response：

You could use a positive lookahead to end the match at a period that is followed by whitespace, or the end of the line. Consider the following regex:

I am .*?(?:\.(?=\s)|$) 
                      
Explanation:          
I am                   : Literally "I am "
     .*?               : One or more of any character, lazy match
        (?:          ) : Non-capturing group
           \.          : A literal period
             (?=  )    : Positive lookahead
                \s     : Whitespace or end of string
                   |$  : End of string / line

Try it on regex101

If you want to allow multiple characters to end a sentence, modify the regex to match those instead of only a period: I am .*?(?:[.!?](?=\s)|$)

In python:

import re

strings = """I am OK. How about you?
I think I am better than that.
I am enjoing https://stackoverflow.com/. People are very helpful there.
I am worried about Josh""".split('\n')

regex = re.compile(r"I am .*?(?:\.(?=\s)|$)")

for s in strings:
    print("Input string:\n"   repr(s))
    m = regex.findall(s)
    if m:
        print("Matched string:\n"   repr(m[0]))
    else:
        print("No match")
    print("")

which gives the expected output:

Input string:
'I am OK. How about you?'
Matched string:
'I am OK.'

Input string:
'I think I am better than that.'
Matched string:
'I am better than that.'

Input string:
'I am enjoing https://stackoverflow.com/. People are very helpful there.'
Matched string:
'I am enjoing https://stackoverflow.com/.'

Input string:
'I am worried about Josh'
Matched string:
'I am worried about Josh'

CodePudding user response：

A very simple solution is to just get everything starting a I am and not including any . followed by a space (since you want the periods in a URL):

import re

text = "I think I am enjoying https://stackoverflow.com/. People are very helpful there."
print(re.search('I am(?:[^\.]|\.(?!\s))*', text).group(0))

text = "I am happy. I am satisfied."
print(re.search('I am(?:[^\.]|\.(?!\s))*', text).group(0))

Output:

I am enjoing https://stackoverflow.com/
I am happy

The regex works because it matches anything starting with I am, followed by a repetition of zero or more times (*) either a character that's not a period ([^\.]) or (|) a period that's not followed by whitespace (\.(?!\s)). All of that's in parentheses around the | with a ?: at the start to indicate that the group does not need to be matched separately (for efficiency).