I need to group the year after † char.
The pattern I found useful: †\D*(?<death_year>\d )*
(† 1656
), (ur. 1520, † ok. 1585
) - it works fine
but in this case:
(† 21
VII 1595)
I need to match only the last group of digits
I can rely on )
but those cases are possible: († 1656 ), († 1656? ), († 1656?)
Can you help me figure it out?
CodePudding user response:
You can grab the first 4-digit chunk after than char:
†.*?(?<death_year>\d{4})
See this regex demo. Here, †
is matched first, then .*?
consumes zero or more chars other than line break chars as few as possible, and then (?<death_year>\d{4})
captures four digits.
Although in this case it seems redundant, a check for "exactly four digits" might make it safer:
†.*?\b(?<death_year>\d{4})\b
†.*?(?<!\d)(?<death_year>\d{4})(?!\d)
†(?:.*?\D)??(?<death_year>\d{4})(?!\d)
See the regex demo. Here, \b
fails the match if the four digits have a letter/digit/underscore on either side, (?<!\d)
/(?!\d)
does the same if there is a digit on either side, and the third option does the same as the second except without a lookbehind ((?:.*?\D)??
matches an optional occurrence of any 0 non-linebreak chars as few as possible, and then a non-digit char lazily, so that 4 digits could be captured right after †
).