Home > Back-end >  Convert JSON Unicode escape sequences of the form '\uA90F' to XML character references us
Convert JSON Unicode escape sequences of the form '\uA90F' to XML character references us

Time:12-10

I want to convert JSON Unicode escape sequences of the form \uA90F to XML character references of the form &#xA90F&#59;.

The Replace in the script below is working, but I cannot figure out how to substitute back into the original string, so that this:

Zo\u00EB C\u00E9cile Fran\u00E7oise

Becomes:

Zoë Cécile Françoise

Dim RegX, InputText, Pattern, ReplacedText

            'Zoe        Francoise       Cecile
InputText = "Zo\u00E2   Fran\u00E7oise  C\u00E9cile"

SearchPattern = "\\u[a-zA-Z0-9]{4}"

set RegX = New RegExp
RegX.Pattern = SearchPattern
RegX.Global = True

Set Matches = RegX.Execute(InputText)
For Each Match In Matches
    s = Replace(Match, "\u", "&#x") & ";" '<-- ** This works fine **
    MsgBox(s)
Next

CodePudding user response:

You need to match hex chars with [A-Fa-f0-9], not [a-zA-Z0-9].

Also, you need to use a capturing group around the pattern you need to keep in the result, and a backreference in the RegExp.Replace method (you needn't first collect the matches).

So, you can use

Dim RegX, InputText, ReplacedText

InputText = "Zo\u00E2   Fran\u00E7oise  C\u00E9cile"

Set RegX = New RegExp
RegX.Pattern = "\\u([a-fA-F0-9]{4})"
RegX.Global = True

ReplacedText = RegX.Replace(InputText, "&#x$1;")
  
MsgBox(ReplacedText)

See the regex demo.

CodePudding user response:

Note: as @MichaelKay mentioned in comments, what you are attempting to do will only work for Unicode characters <= U FFFF. For higher-valued codepoints, the code will get a bit more more complicated.

Each Match has three properties:

  • Value - the actual text in the search string that was matched.

  • FirstIndex - the 0-based index of the first character of the match in the search string.

  • Length - the length of the matched string.

So, you know exactly where each Match was found in the InputText string. You can use that information to create a new string with the replaced substrings in it, eg:

Dim RegX, InputText, SearchPattern, ReplacedText, S, StartIdx, FoundIdx

            'Zoe        Francoise       Cecile
InputText = "Zo\u00E2   Fran\u00E7oise  C\u00E9cile"

SearchPattern = "\\u([a-fA-F0-9]{4})"

set RegX = New RegExp
RegX.Pattern = SearchPattern
RegX.Global = True

Set Matches = RegX.Execute(InputText)

StartIdx = 1
ReplacedText = ""

For Each Match In Matches
    S = Replace(Match.Value, "\u", "&#x") & ";"
    MsgBox(S)
    FoundIdx = Match.FirstIndex   1
    If FoundIdx > StartIdx Then
        ReplacedText = ReplacedText   Mid(InputText, StartIdx, FoundIdx - StartIdx)
    End If
    ReplacedText = ReplacedText   S
    StartIdx = FoundIdx   Match.Length
Next

If StartIdx <= Len(InputText) Then
    ReplacedText = ReplacedText   Mid(InputText, StartIdx)
End If

Alternatively, you could just use the RegExp.Replace() method, like @WiktorStribiżew's answer demonstrates.

  • Related