Home > Blockchain >  Having difficulty matching entire sentence containing certain words, if the word of interest is the
Having difficulty matching entire sentence containing certain words, if the word of interest is the

Time:12-08

Utilizing REGEX pattern:

[^?!.\s][^?!.]*?\b([Cc]at|[Dd]og|[Bb]ird)\b[^?!.]*[.?!]

to match an entire sentence with the above-included words, even if the sentence spans multiple lines.

However, I've found that if the word of interest is the first in the sentence, it will not match.

For example: The bird is dead. Will Match. Dog days are over. Will Not. Often the sentences I'm looking for are incomplete grammatically as the second listed, but follow a beginning capitalization and followed by period structure.

CodePudding user response:

You can use

(?=\s)[^?!.]*?\b([Cc]at|[Dd]og|[Bb]ird)\b[^?!.]*[.?!]
\b[^?!.]*?\b([Cc]at|[Dd]og|[Bb]ird)\b[^?!.]*[.?!]

In the first regex, the first matched char MUST be a non-whitespace char because the (?=\s) is a positive lookahead that matches a location that is immediately followed with a whitespace char.

The \b in the second variant is more specific and matches a position between a start of string/non-word char and a word char, or between a word char and a non-word char/end of string.

Note that in JavaScript \b word boundary is not Unicode-aware, and if you need full Unicode word boundary support, you will need a workaround, see Replace certain arabic words in text string using Javascript.

  • Related