I have a list of strings. I want to extract from each the first sentence starting with "I am"
. For example:
"I am OK. How about you?"
-> "I am OK."
"I think I am better than that."
-> "I am better than that."
I'm using a very simple function that works most of the time:
def extract_sentence(text):
start_idx = text.find("I am")
dot_idx = text[pred_idx:].find(".")
return text[start_idx:start_idx dot_idx 1]
But it won't work in some instances. For example:
"I am enjoying https://stackoverflow.com/. People are very helpful there."
-> "I am enjoing https://stackoverflow."
(should have been "I am enjoying https://stackoverflow.com/."
)
"I am worried about Josh"
-> ""
(no ending dot, should have been a copy of the input, i.e. "I am worried about Josh"
).
I believe some regex would help me here, but I don't know how to start. Any ideas?
CodePudding user response:
You could use a positive lookahead to end the match at a period that is followed by whitespace, or the end of the line. Consider the following regex:
I am .*?(?:\.(?=\s)|$)
Explanation:
I am : Literally "I am "
.*? : One or more of any character, lazy match
(?: ) : Non-capturing group
\. : A literal period
(?= ) : Positive lookahead
\s : Whitespace or end of string
|$ : End of string / line
If you want to allow multiple characters to end a sentence, modify the regex to match those instead of only a period: I am .*?(?:[.!?](?=\s)|$)
In python:
import re
strings = """I am OK. How about you?
I think I am better than that.
I am enjoing https://stackoverflow.com/. People are very helpful there.
I am worried about Josh""".split('\n')
regex = re.compile(r"I am .*?(?:\.(?=\s)|$)")
for s in strings:
print("Input string:\n" repr(s))
m = regex.findall(s)
if m:
print("Matched string:\n" repr(m[0]))
else:
print("No match")
print("")
which gives the expected output:
Input string:
'I am OK. How about you?'
Matched string:
'I am OK.'
Input string:
'I think I am better than that.'
Matched string:
'I am better than that.'
Input string:
'I am enjoing https://stackoverflow.com/. People are very helpful there.'
Matched string:
'I am enjoing https://stackoverflow.com/.'
Input string:
'I am worried about Josh'
Matched string:
'I am worried about Josh'
CodePudding user response:
A very simple solution is to just get everything starting a I am
and not including any .
followed by a space (since you want the periods in a URL):
import re
text = "I think I am enjoying https://stackoverflow.com/. People are very helpful there."
print(re.search('I am(?:[^\.]|\.(?!\s))*', text).group(0))
text = "I am happy. I am satisfied."
print(re.search('I am(?:[^\.]|\.(?!\s))*', text).group(0))
Output:
I am enjoing https://stackoverflow.com/
I am happy
The regex works because it matches anything starting with I am
, followed by a repetition of zero or more times (*
) either a character that's not a period ([^\.]
) or (|
) a period that's not followed by whitespace (\.(?!\s)
). All of that's in parentheses around the |
with a ?:
at the start to indicate that the group does not need to be matched separately (for efficiency).