Home > front end >  How to match names separated by "and" excluding "and" itself using regex?
How to match names separated by "and" excluding "and" itself using regex?

Time:01-14

I am trying to solve http://play.inginf.units.it/#/level/10

I have some strings as follows:

title={AUTOMATIC ROCKING DEVICE},
author={Diaz, Navarro David and Gines, Rodriguez Noe},
year={2006},

title={The sitting position in neurosurgery: a retrospective analysis of 488 cases},
author={Standefer, Michael and Bay, Janet W and Trusso, Russell},
journal={Neurosurgery},

title={Fuel cells and their applications},
author={Kordesch, Karl and Simader, G{"u}nter and Wiley, John},
volume={117},

I need to match the names in bold. I tried the following regex:

(?<=author={). (?=})

But it matches the entire string inside {}. I understand why is it so but how can I break the pattern with and?

CodePudding user response:

It took me a little while to get the samples to show up in your link. What about:

(?:^\s*author={|\G(?!^) and )\K(?:(?! and |},).) 

See an online demo


  • (?:^\s*author={|\G(?!^) and ) - Either match start of a line followed by 0 whitespace chars and literally match 'author={` or assert position at end of previous match but negate start-line;
  • \K - Reset starting point of reported match;
  • (?:(?! and |},).) - Match any if it's not followed by ' and ' or match a '}' followed by a comma.

Above will also match 'others' as per last sample in linked test. If you wish to exclude 'others' then maybe add the option to the negated list as per:

(?:^\s*author={|\G(?!^) and )\K(?:(?! and |},|\bothers\b).) 

See an online demo


In the comment section we established above would not work for given linked website. Apparently its JS based which would support zero-width lookbehind. Therefor try:

(?<=\bauthor={(?:(?!\},).*?))\b[A-Z]\S*\b(?:,? [A-Z]\S*\b)*

See the demo

  • (?<= - Open lookbehind;
    • \bauthor={ - Match word-boundary and literally 'author={';
    • (?:(?!\},).*?)) - Open non-capture group to match a negative lookahead for '},' and 0 (lazy) characters. Close lookbehind;
  • \b[A-Z]\S*\b - Match anything between two word-boundaries starting with a capital letter A-Z followed by 0 non-whitespace chars;
  • (?:,? [A-Z]\S*\b)* - A 2nd non-capture group to keep matching comma/space seperated parts of a name.

CodePudding user response:

If using a lookbehind assertion is supported and matching word characters, you might use:

(?<=\bauthor={[^{}]*(?:{[^{}]*}[^{}]*)*)[A-Z][^\s,]*,(?:\s [A-Z][^\s,]*) \b

Explanation

  • (?<= Postive lookahead, assert that to the left of the current position is
    • \bauthor={ Match author={ preceded by a word boundary
    • [^{}]*(?:{[^{}]*}[^{}]*)* Match optional chars other than { } or match {...}
  • ) Close the lookbehind
  • [A-Z] Match an uppercase char A-Z
  • [^\s,]*, Optionally match non whitespace chars except , and then match ,
  • (?: Non capture group to repeat as a whole part
    • \s [A-Z][^\s,]* Match 1 whitespace chars, uppercase char A-Z, optional non whitespace chars except ,
  • ) Close the non capture group and repeat it 1 or more times
  • \b a word boundary

See a regex101 demo.

  • Related