Home > other >  Regex that matches two or three words, but does no catpure the third if it is a specific word
Regex that matches two or three words, but does no catpure the third if it is a specific word

Time:06-24

I need to match a specific pattern but I'm unable to do it with regular expressions. I'm looking for people's name. It follows always the same patterns. Some combinations are:

  1. Mr. Snow
  2. Mr. John Snow
  3. Mr. John Snow (Winterfall of the nord lands)

My problem comes when sometimes I have things like: Mr. Snow and Ms. Stark. It captures also the and. So I'm looking for a regular expression that does not capture the second name only if it is and. Here I'm looking for ["Mr. Snow", "Ms. Stark"].

My best try is as follows:

(M[rs].\s\w (?:\s[\w-] )(?:\s\([^\)]*\))?).

Note that the second name is in a non-capturing group. Because I was thinking to use a negative look-ahead, but If I do that, the first word is not captured (because the entire name does not match), and I need that to be captured.

Any Ideas?

Here is some text to fast check.

CodePudding user response:

As it is a name of a person you could also check that the first letters of the words be uppercases.

M[rs].\s[A-Z]\w (?:\s[A-Z]\w (?:\s\([^\)]*\))?)?

See the regex demo

CodePudding user response:

Matching names is difficult, see this page for a nice article:

Falsehoods Programmers Believe About Names.

For the examples that you have given, you might use:

\bM[rs]\.(?: (?!M[rs]\.|and )\w )*

Explanation

  • \b A word boundary
  • M[rs]\. Match either Mr or Ms followed by a dot (note to escape it)
  • (?: Non capture group
    • Match a space (Or \s if you want want to allow newlines)
    • (?!M[rs]\.|and ) Negative lookahead, assert that from the current position there is not Mr or Ms or and directly to the right
    • \w Match 1 word characters
  • )* Close the non capture group and optionally repeat it

Regex demo

CodePudding user response:

Here is my two cents:

\bM[rs]\.\h(\p{Lu}\p{Ll} (?:[\h-]\p{Lu}\p{Ll} )*)\b

See an online demo


  • \b - A word-boundary;
  • M[rs]\.\h - Match Mr. or Ms. followed by a horizontal whitespace;
  • (\p{Lu}\p{Ll} (?:[\h-]\p{Lu}\p{Ll} )*) - A capture group with a nested non-capture group to match an uppercase letter followed by lowercase letters and 0 2nd names concatenated through whitespace or hyphen;
  • \b - A word-boundary.

CodePudding user response:

This captures the first name in group 1 and the second in group 2if the second name exists and is not and:

(?<=M[rs]\. )(\w )(?: (?!and)(\w ))?

See live demo.


If you want to capture the title as group 1 and the names as groups 2 and 3, change the look behind to a capture group:

(M[rs]\.) (\w )(?: (?!and)(\w ))?
  • Related