How to match names separated by "and" excluding "and" itself using regex?-CodePudding

I am trying to solve http://play.inginf.units.it/#/level/10

I have some strings as follows:

title={AUTOMATIC ROCKING DEVICE},
author={Diaz, Navarro David and Gines, Rodriguez Noe},
year={2006},

title={The sitting position in neurosurgery: a retrospective analysis of 488 cases},
author={Standefer, Michael and Bay, Janet W and Trusso, Russell},
journal={Neurosurgery},

title={Fuel cells and their applications},
author={Kordesch, Karl and Simader, G{"u}nter and Wiley, John},
volume={117},

I need to match the names in bold. I tried the following regex:

(?<=author={). (?=})

But it matches the entire string inside {}. I understand why is it so but how can I break the pattern with and?

CodePudding user response：

It took me a little while to get the samples to show up in your link. What about:

(?:^\s*author={|\G(?!^) and )\K(?:(?! and |},).)

See an online demo

(?:^\s*author={|\G(?!^) and ) - Either match start of a line followed by 0 whitespace chars and literally match 'author={` or assert position at end of previous match but negate start-line;
\K - Reset starting point of reported match;
(?:(?! and |},).) - Match any if it's not followed by ' and ' or match a '}' followed by a comma.

Above will also match 'others' as per last sample in linked test. If you wish to exclude 'others' then maybe add the option to the negated list as per:

(?:^\s*author={|\G(?!^) and )\K(?:(?! and |},|\bothers\b).)

See an online demo

In the comment section we established above would not work for given linked website. Apparently its JS based which would support zero-width lookbehind. Therefor try:

(?<=\bauthor={(?:(?!\},).*?))\b[A-Z]\S*\b(?:,? [A-Z]\S*\b)*

See the demo

(?<= - Open lookbehind;
- \bauthor={ - Match word-boundary and literally 'author={';
- (?:(?!\},).*?)) - Open non-capture group to match a negative lookahead for '},' and 0 (lazy) characters. Close lookbehind;
\b[A-Z]\S*\b - Match anything between two word-boundaries starting with a capital letter A-Z followed by 0 non-whitespace chars;
(?:,? [A-Z]\S*\b)* - A 2nd non-capture group to keep matching comma/space seperated parts of a name.

CodePudding user response：

If using a lookbehind assertion is supported and matching word characters, you might use:

(?<=\bauthor={[^{}]*(?:{[^{}]*}[^{}]*)*)[A-Z][^\s,]*,(?:\s [A-Z][^\s,]*) \b

Explanation

(?<= Postive lookahead, assert that to the left of the current position is
- \bauthor={ Match author={ preceded by a word boundary
- [^{}]*(?:{[^{}]*}[^{}]*)* Match optional chars other than { } or match {...}
) Close the lookbehind
[A-Z] Match an uppercase char A-Z
[^\s,]*, Optionally match non whitespace chars except , and then match ,
(?: Non capture group to repeat as a whole part
- \s [A-Z][^\s,]* Match 1 whitespace chars, uppercase char A-Z, optional non whitespace chars except ,
) Close the non capture group and repeat it 1 or more times
\b a word boundary

See a regex101 demo.