I am working with historical text and I want to reformat it with RegEx. Problem is: There are lots of special characters (that is: letters) in the text that are not matched by RegEx character classes like [a-z] / [A-Z] or \w . For example I want to match the dot (and only the dot) in the following line:
<tag1>Quomodo restituendus locus Demosth. Olÿnth</tag1>
Without the ÿ I could easily work with the mentioned character classes, like:
(?<=(<tag1>(\w|\s)*))\.(?=((\w|\s)*</tag1>))
But it does not work with special characters that are not covered by ASCII. I tried lots of things but I can't make it work so the RegEx really only captures the dot in this very line. If I use more general Expressions like (.)* (instead of (\w|\s)* ) I get many more of the dots in the document (for example dots that are not between an opening and a closing tag but in between two such tagsets), which is not what I want. Any ideas for an expression that covers like all unicode letters?
CodePudding user response:
You may match any text between <
and >
with [^<>]*
:
(?<=(<tag1>[^<>]*))\.(?=([^<>]*</tag1>))
See the regex demo. Not sure you need all those capturing groups, you might get what you need without them:
(?<=<tag1>[^<>]*)\.(?=[^<>]*</tag1>)
See this regex demo. Details:
(?<=<tag1>[^<>]*)
- a location immediately preceded with<tag1
and then any zero or more chars other than<
and>
\.
- a dot(?=[^<>]*</tag1>)
- a location immediately preceded with any zero or more chars other than<
and>
and then</tag1>
.
CodePudding user response:
use a negated character class that exculdes the dot and the opening angle bracket:
(?<=<tag1>[^.<]*(?:<(?!/tag1>)[^.<]*)*)\.
with this kind of pattern it isn't even needed to check the closing tag. But if you absolutely want to check it, ends the pattern with:
(?=[^<]*(?:<(?!/tag1>)[^<]*)*</tag1>)