Home > Blockchain >  RegEx in VSCode: capture every character/letter - not just ASCII
RegEx in VSCode: capture every character/letter - not just ASCII

Time:12-20

I am working with historical text and I want to reformat it with RegEx. Problem is: There are lots of special characters (that is: letters) in the text that are not matched by RegEx character classes like [a-z] / [A-Z] or \w . For example I want to match the dot (and only the dot) in the following line:

<tag1>Quomodo restituendus locus Demosth. Olÿnth</tag1>

Without the ÿ I could easily work with the mentioned character classes, like:

(?<=(<tag1>(\w|\s)*))\.(?=((\w|\s)*</tag1>))

But it does not work with special characters that are not covered by ASCII. I tried lots of things but I can't make it work so the RegEx really only captures the dot in this very line. If I use more general Expressions like (.)* (instead of (\w|\s)* ) I get many more of the dots in the document (for example dots that are not between an opening and a closing tag but in between two such tagsets), which is not what I want. Any ideas for an expression that covers like all unicode letters?

CodePudding user response:

You may match any text between < and > with [^<>]*:

(?<=(<tag1>[^<>]*))\.(?=([^<>]*</tag1>))

See the regex demo. Not sure you need all those capturing groups, you might get what you need without them:

(?<=<tag1>[^<>]*)\.(?=[^<>]*</tag1>)

See this regex demo. Details:

  • (?<=<tag1>[^<>]*) - a location immediately preceded with <tag1 and then any zero or more chars other than < and >
  • \. - a dot
  • (?=[^<>]*</tag1>) - a location immediately preceded with any zero or more chars other than < and > and then </tag1>.

CodePudding user response:

use a negated character class that exculdes the dot and the opening angle bracket:

(?<=<tag1>[^.<]*(?:<(?!/tag1>)[^.<]*)*)\.

with this kind of pattern it isn't even needed to check the closing tag. But if you absolutely want to check it, ends the pattern with:

(?=[^<]*(?:<(?!/tag1>)[^<]*)*</tag1>)
  • Related