Home > Enterprise >  How to extract a sentence containing a certain word without missing periods
How to extract a sentence containing a certain word without missing periods

Time:10-11

Using RegEx I am trying to extract all sentences from an article containing the word "Figure" and I have this:

((?<=^|\s)[A-Za-z0-9][^!?.]*(Figure)[^.]*(?=\.|\!|))

which will, in the case of the sentence, "This effect (Smith et al., 2008) was seen in 0.0001% of samples (Figure 1b).", will give me, "of samples (Figure 1b)"

How could I modify my code so it will allow me to capture the decimals and references as well, thereby including the entire sentence from start to end?

CodePudding user response:

Assuming that your sentences are well formed (which given they look like they're from a scientific journal they should be) and always start with a capital letter preceded by a space or beginning of string, you can use this regex:

(?:^|(?<=[.!?]\s))(?=[A-Z])(?:[^.!?]|[.!?](?!$|\s[A-Z]))*Figure.*?[.!?](?=$|\s[A-Z])

This matches:

  • (?:^|(?<=[.!?]\s)) : either start of string or a lookbehind that asserts a .,?, or ! followed by a space
  • (?=[A-Z]) : a lookahead asserting a capital letter (we use a lookahead here so we can match Figure if it's the first word in the sentence)
  • (?:[^.!?]|[.!?](?!$|\s[A-Z]))* some number of either a non-sentence ending character or a sentence ending character that is not followed by end-of-string or a space and a capital letter
  • Figure : the word Figure
  • .*?[.!?] : a minimal number of characters followed by a sentence ending character
  • (?=$|\s[A-Z]) : a lookahead that asserts either end of string or a space and a capital letter (i.e. the start of a new sentence)

Regex demo on regex101

  • Related