Home > Software engineering >  Regex to match whole sentence but with exeptions
Regex to match whole sentence but with exeptions

Time:10-31

I use this regex to mark whole sentences ending in period, question mark or exclamation mark - it works so far.

[A-Z][^.!?]*[.?!]

But there are two problems:

  1. if there is a number followed by a period in the sentence.
  2. if there is an abbreviation with a period in the sentence.

Then the sentence is extracted incorrectly. Example:

  1. Sentence Example: "Er kam am 1. November."
  1. Sentence Example: "Dr. Schiwago."

The first sentence then becomes two sentences because a period follows the number. The second sentence then also becomes two sentences because the abbreviation ends in a period.

How can I adjust the regex so that both problems do not occur?

So in the first sentence, whenever a period follows a number, this should not be seen as the end of the sentence, but the regex continues to the next period.

In the second sentence, for example, a minimum size of 4 characters would ensure that the abbreviation is not seen as a complete sentence.

DEMO

CodePudding user response:

You can use

\b(?:\d \.\s*[A-Z]|(?:[DJS]r|M(?:rs?|(?:is)?s))\.|[^.!?])*[.?!]

See the regex demo. Add the known abbreviations and other patterns you come across as alternatives to the non-capturing group.

  • \b - word boundary (Note: add (?=[A-Z]) after \b if you need to start matching with an uppercase letter)
  • (?:\d \.\s*[A-Z]|(?:[DJS]r|M(?:rs?|iss))\.|[^.!?])* - zero or more occurrences of:
    • \d \.\s*[A-Z] - one or more digits, ., zero or more whitespaces, uppercase letter
    • | - or
    • (?:[DJS]r|M(?:rs?|(?:is)?s))\. - Dr., Jr., Sr., Mr, Mrs, Ms, Miss
    • | - or
    • [^.!?] - a char other than ., ! and ?
  • [.?!] - a ., ? or ! char.

CodePudding user response:

Try (regex101)

[A-Z].*?(?<!Dr)(?<!\d)\s*[.!?]

  • [A-Z] - start with capital letter

  • .*? - non-greedy match-all

  • (?<!Dr)(?<!\d)\s*[.!?] - we match until ., ! or ?, there must not be Dr or digit before it.

  • Related