Home > front end >  Using regex to match groups that may not contain given text patterns
Using regex to match groups that may not contain given text patterns

Time:01-13

I'd like to use regex to extract birth dates and places, as well as (when they're defined) death dates and places, from a collection of encyclopedia entries. Here are some examples of such entries which illustrate the patterns that I'm trying to codify:

William Wordsworth, (born April 7, 1770, Cockermouth, Cumberland, England—died April 23, 1850, Rydal Mount, Westmorland), English poet...

Jane Goodall, in full Dame Jane Goodall, original name Valerie Jane Morris-Goodall, (born April 3, 1934, London, England), British ethologist...

Kenneth Wartinbee Spence, (born May 6, 1907, Chicago, Illinois, U.S.—died January 12, 1967, Austin, Texas), American psychologist...

I was hoping that the following regex pattern would identify the desired capture groups:

\(born (\w \d{1,2}, \d{4})(?:, )(.*?)(?:—died )?(\w \d{1,2}, \d{4})?(?:, )?(.*?)\)

But sadly, it does not. (Use https://regexr.com/ to view the results.)

Note: When I restructure the pattern around the phrase —died in the following way, the pattern does produce the expected results for entries like the first and third above (those with given death dates/places), but it obviously does not work in all cases.

\(born (\w \d{1,2}, \d{4})(?:, )(.*?)—died (\w \d{1,2}, \d{4})?(?:, )?(.*?)\)

What am I missing?

CodePudding user response:

In general, you can mark the whole --died section as optional:

\(born (\w \d{1,2}, \d{4})(?:, )(.*?)(—died (\w \d{1,2}, \d{4})?(?:, )?(.*?))?\)

https://regexr.com/765fc

CodePudding user response:

You can use the negative lookahead assertion (?!) to match groups that do not contain a given text pattern. The syntax for this is (?!pattern) where "pattern" is the text you want to exclude from the match. For example, if you want to match all groups of characters that do not contain the letter "a", you would use the regex "(?!a).*" This will match any group of characters that does not have an "a" in it.

  • Related