Home > Software engineering >  Identify and replace non-ASCII characters between brackets
Identify and replace non-ASCII characters between brackets

Time:12-24

I have tags (only ASCII chars inside brackets) of the following structure: [Root.GetSomething], instead, some contributors ended up submitting contributions with Cyrillic chars that look similar to Latin ones, e.g. [Rооt.GеtSоmеthіng]. I need to locate, and then replace those inconsistencies with the matching ASCII characters inside the brackets.

I tried \[([АаІіВСсЕеРТтОоКкХхМ] )\]; (\[)([^\x00-\x7F] )(\]), and some variations of the range but those searches don't see any matches. I seem to be missing something important in the regex execution logic.

CodePudding user response:

You can use a regex matching any "interesting" Cyrillic char in between [ letters or . ] and a conditional replacement pattern:

Find What: (?:\G(?!\A)|\[)[a-zA-Z.]*\K(?:(А)|(а)|(І)|(і)|(В)|(С)|(с)|(Е)|(е)|(Р)|(Т)|(т)|(О)|(о)|(К)|(к)|(Х)|(х)|(М))(?=[[:alpha:].]*])
Replace With: (?1A:?2a:?3I:?4i:?5B:?6C:?7c:?8E:?9e:?{10}P:?{11}T:?{12}t:?{13}O:?{14}o:?{15}K:?{16}k:?{17}X:?{18}x:?{19}M)

Make sure Match Case option is ON. See a enter image description here string:

enter image description here

Details:

  • (?:\G(?!\A)|\[) - end of the previous successful match or a [ char
  • [a-zA-Z.]* - zero or more . or ASCII letters
  • \K - match reset operator that discards the currently matched text from the overall match memory buffer
  • (?:(А)|(а)|(І)|(і)|(В)|(С)|(с)|(Е)|(е)|(Р)|(Т)|(т)|(О)|(о)|(К)|(к)|(Х)|(х)|(М)) - a non-capturing group containing 19 alternatives each of which is put into a separate capturing group
  • (?=[[:alpha:].]*]) - a positive lookahead that requires zero or more letters or . and then a ] char immediately to the right of the current location.

The (?1A:?2a:?3I:?4i:?5B:?6C:?7c:?8E:?9e:?{10}P:?{11}T:?{12}t:?{13}O:?{14}o:?{15}K:?{16}k:?{17}X:?{18}x:?{19}M) replacement pattern replaces А with A (\u0410) if Group 1 matched, а (\u0430) with a if Group 2 matched, etc.

  • Related