I use this regex to mark whole sentences ending in period, question mark or exclamation mark - it works so far.
[A-Z][^.!?]*[.?!]
But there are two problems:
- if there is a number followed by a period in the sentence.
- if there is an abbreviation with a period in the sentence.
Then the sentence is extracted incorrectly. Example:
- Sentence Example: "Er kam am 1. November."
- Sentence Example: "Dr. Schiwago."
The first sentence then becomes two sentences because a period follows the number. The second sentence then also becomes two sentences because the abbreviation ends in a period.
How can I adjust the regex so that both problems do not occur?
So in the first sentence, whenever a period follows a number, this should not be seen as the end of the sentence, but the regex continues to the next period.
In the second sentence, for example, a minimum size of 4 characters would ensure that the abbreviation is not seen as a complete sentence.
CodePudding user response:
You can use
\b(?:\d \.\s*[A-Z]|(?:[DJS]r|M(?:rs?|(?:is)?s))\.|[^.!?])*[.?!]
See the regex demo. Add the known abbreviations and other patterns you come across as alternatives to the non-capturing group.
\b
- word boundary (Note: add(?=[A-Z])
after\b
if you need to start matching with an uppercase letter)(?:\d \.\s*[A-Z]|(?:[DJS]r|M(?:rs?|iss))\.|[^.!?])*
- zero or more occurrences of:\d \.\s*[A-Z]
- one or more digits,.
, zero or more whitespaces, uppercase letter|
- or(?:[DJS]r|M(?:rs?|(?:is)?s))\.
-Dr.
,Jr.
,Sr.
,Mr
,Mrs
,Ms
,Miss
|
- or[^.!?]
- a char other than.
,!
and?
[.?!]
- a.
,?
or!
char.
CodePudding user response:
Try (regex101)
[A-Z].*?(?<!Dr)(?<!\d)\s*[.!?]
[A-Z]
- start with capital letter.*?
- non-greedy match-all(?<!Dr)(?<!\d)\s*[.!?]
- we match until.
,!
or?
, there must not beDr
or digit before it.