I want to convert JSON Unicode escape sequences of the form \uA90F
to XML character references of the form ꤏ;
.
The Replace
in the script below is working, but I cannot figure out how to substitute back into the original string, so that this:
Zo\u00EB C\u00E9cile Fran\u00E7oise
Becomes:
Zoë Cécile Françoise
Dim RegX, InputText, Pattern, ReplacedText
'Zoe Francoise Cecile
InputText = "Zo\u00E2 Fran\u00E7oise C\u00E9cile"
SearchPattern = "\\u[a-zA-Z0-9]{4}"
set RegX = New RegExp
RegX.Pattern = SearchPattern
RegX.Global = True
Set Matches = RegX.Execute(InputText)
For Each Match In Matches
s = Replace(Match, "\u", "&#x") & ";" '<-- ** This works fine **
MsgBox(s)
Next
CodePudding user response:
You need to match hex chars with [A-Fa-f0-9]
, not [a-zA-Z0-9]
.
Also, you need to use a capturing group around the pattern you need to keep in the result, and a backreference in the RegExp.Replace
method (you needn't first collect the matches).
So, you can use
Dim RegX, InputText, ReplacedText
InputText = "Zo\u00E2 Fran\u00E7oise C\u00E9cile"
Set RegX = New RegExp
RegX.Pattern = "\\u([a-fA-F0-9]{4})"
RegX.Global = True
ReplacedText = RegX.Replace(InputText, "&#x$1;")
MsgBox(ReplacedText)
See the regex demo.
CodePudding user response:
Note: as @MichaelKay mentioned in comments, what you are attempting to do will only work for Unicode characters <= U FFFF
. For higher-valued codepoints, the code will get a bit more more complicated.
Each Match
has three properties:
Value
- the actual text in the search string that was matched.FirstIndex
- the 0-based index of the first character of the match in the search string.Length
- the length of the matched string.
So, you know exactly where each Match
was found in the InputText
string. You can use that information to create a new string with the replaced substrings in it, eg:
Dim RegX, InputText, SearchPattern, ReplacedText, S, StartIdx, FoundIdx
'Zoe Francoise Cecile
InputText = "Zo\u00E2 Fran\u00E7oise C\u00E9cile"
SearchPattern = "\\u([a-fA-F0-9]{4})"
set RegX = New RegExp
RegX.Pattern = SearchPattern
RegX.Global = True
Set Matches = RegX.Execute(InputText)
StartIdx = 1
ReplacedText = ""
For Each Match In Matches
S = Replace(Match.Value, "\u", "&#x") & ";"
MsgBox(S)
FoundIdx = Match.FirstIndex 1
If FoundIdx > StartIdx Then
ReplacedText = ReplacedText Mid(InputText, StartIdx, FoundIdx - StartIdx)
End If
ReplacedText = ReplacedText S
StartIdx = FoundIdx Match.Length
Next
If StartIdx <= Len(InputText) Then
ReplacedText = ReplacedText Mid(InputText, StartIdx)
End If
Alternatively, you could just use the RegExp.Replace()
method, like
@WiktorStribiżew's answer demonstrates.