I need to match a specific pattern but I'm unable to do it with regular expressions. I'm looking for people's name. It follows always the same patterns. Some combinations are:
- Mr. Snow
- Mr. John Snow
- Mr. John Snow (Winterfall of the nord lands)
My problem comes when sometimes I have things like: Mr. Snow and Ms. Stark
. It captures also the and
. So I'm looking for a regular expression that does not capture the second name only if it is and
. Here I'm looking for ["Mr. Snow", "Ms. Stark"]
.
My best try is as follows:
(M[rs].\s\w (?:\s[\w-] )(?:\s\([^\)]*\))?)
.
Note that the second name is in a non-capturing group. Because I was thinking to use a negative look-ahead, but If I do that, the first word is not captured (because the entire name does not match), and I need that to be captured.
Any Ideas?
Here is some text to fast check.
CodePudding user response:
As it is a name of a person you could also check that the first letters of the words be uppercases.
M[rs].\s[A-Z]\w (?:\s[A-Z]\w (?:\s\([^\)]*\))?)?
See the regex demo
CodePudding user response:
Matching names is difficult, see this page for a nice article:
Falsehoods Programmers Believe About Names.
For the examples that you have given, you might use:
\bM[rs]\.(?: (?!M[rs]\.|and )\w )*
Explanation
\b
A word boundaryM[rs]\.
Match either Mr or Ms followed by a dot (note to escape it)(?:
Non capture group\s
if you want want to allow newlines)(?!M[rs]\.|and )
Negative lookahead, assert that from the current position there is not Mr or Ms orand
directly to the right\w
Match 1 word characters
)*
Close the non capture group and optionally repeat it
CodePudding user response:
Here is my two cents:
\bM[rs]\.\h(\p{Lu}\p{Ll} (?:[\h-]\p{Lu}\p{Ll} )*)\b
See an online demo
\b
- A word-boundary;M[rs]\.\h
- MatchMr.
orMs.
followed by a horizontal whitespace;(\p{Lu}\p{Ll} (?:[\h-]\p{Lu}\p{Ll} )*)
- A capture group with a nested non-capture group to match an uppercase letter followed by lowercase letters and 0 2nd names concatenated through whitespace or hyphen;\b
- A word-boundary.
CodePudding user response:
This captures the first name in group 1 and the second in group 2if the second name exists and is not and
:
(?<=M[rs]\. )(\w )(?: (?!and)(\w ))?
See live demo.
If you want to capture the title as group 1 and the names as groups 2 and 3, change the look behind to a capture group:
(M[rs]\.) (\w )(?: (?!and)(\w ))?