I have tags (only ASCII chars inside brackets) of the following structure: [Root.GetSomething]
, instead, some contributors ended up submitting contributions with Cyrillic chars that look similar to Latin ones, e.g. [Rооt.GеtSоmеthіng]
.
I need to locate, and then replace those inconsistencies with the matching ASCII characters inside the brackets.
I tried \[([АаІіВСсЕеРТтОоКкХхМ] )\]
; (\[)([^\x00-\x7F] )(\])
, and some variations of the range but those searches don't see any matches. I seem to be missing something important in the regex execution logic.
CodePudding user response:
You can use a regex matching any "interesting" Cyrillic char in between [
letters or .
]
and a conditional replacement pattern:
Find What: (?:\G(?!\A)|\[)[a-zA-Z.]*\K(?:(А)|(а)|(І)|(і)|(В)|(С)|(с)|(Е)|(е)|(Р)|(Т)|(т)|(О)|(о)|(К)|(к)|(Х)|(х)|(М))(?=[[:alpha:].]*])
Replace With: (?1A:?2a:?3I:?4i:?5B:?6C:?7c:?8E:?9e:?{10}P:?{11}T:?{12}t:?{13}O:?{14}o:?{15}K:?{16}k:?{17}X:?{18}x:?{19}M)
Make sure Match Case
option is ON. See a string:
Details:
(?:\G(?!\A)|\[)
- end of the previous successful match or a[
char[a-zA-Z.]*
- zero or more.
or ASCII letters\K
- match reset operator that discards the currently matched text from the overall match memory buffer(?:(А)|(а)|(І)|(і)|(В)|(С)|(с)|(Е)|(е)|(Р)|(Т)|(т)|(О)|(о)|(К)|(к)|(Х)|(х)|(М))
- a non-capturing group containing 19 alternatives each of which is put into a separate capturing group(?=[[:alpha:].]*])
- a positive lookahead that requires zero or more letters or.
and then a]
char immediately to the right of the current location.
The (?1A:?2a:?3I:?4i:?5B:?6C:?7c:?8E:?9e:?{10}P:?{11}T:?{12}t:?{13}O:?{14}o:?{15}K:?{16}k:?{17}X:?{18}x:?{19}M)
replacement pattern replaces А
with A
(\u0410
) if Group 1 matched, а
(\u0430
) with a
if Group 2 matched, etc.