.NET 6 System.Text.Json.JsonSerializer Deserialize UTF-8 escaped string-CodePudding

I have some JSON files (facebook backup) which is UTF-8 encoded but special charachters are escaped. The escaped characters are also UTF-8 encoded but in hexadecimal format. For example:

{
  "sender_name": "Tam\u00c3\u00a1s"
}

I want to use System.Text.Json.JsonSerializer for deserialization. The problem it is interprets the escaped hexes as UTF-16 characters. So it will be deserealized as "TamÃ¡s" not as "Tamás" as it should.

Code to repro:

using System;
using System.Text.Json;
using System.Text.Json.Serialization;

class Msg
{
    [JsonPropertyName("sender_name")]
    public string SenderName { get; set; }
}   

public class Program
{
    public static void Main()
    {
        var data = @"{
            ""sender_name"": ""Tam\u00c3\u00a1s""
        }";
        var msg = JsonSerializer.Deserialize<Msg>(data);
        Console.WriteLine(msg.SenderName);
    }
}

Can i change the serializer to interpret it as UTF-8?

CodePudding user response：

try this code

    var msg = JsonSerializer.Deserialize<Msg>(data);

    msg.SenderName= DecodeFromUtf16ToUtf8(msg.SenderName); // Tamás

public  string DecodeFromUtf16ToUtf8(string utf16String)
{
    // copy the string as UTF-8 bytes.
    byte[] utf8Bytes = new byte[utf16String.Length];
    for (int i = 0; i < utf16String.Length;   i)
            utf8Bytes[i] = (byte)utf16String[i];
    
    return Encoding.UTF8.GetString(utf8Bytes, 0, utf8Bytes.Length);
}

or you can add the json constructor

var msg = System.Text.Json.JsonSerializer.Deserialize<Msg>(data);

public class Msg
{
    [JsonPropertyName("sender_name")]
    public string SenderName { get; set; }
    
    public Msg(string SenderName)
    {
        this.SenderName= DecodeFromUtf16ToUtf8(SenderName);
    }
}

CodePudding user response：

The problem here is that the sender of your JSON has wrong values \u00c3 and \u00a1 for the numeric escape codes for á inside their string literal. The meaning of the \uXXXX escape sequences is specified by the JSON Proposal as well as the JSON Standard. It is defined such that XXXX is the character's "4HEXDIG" UTF-16 Unicode codepoint value ^[1], which, for á, is \u00E1. Instead the provider of your JSON file (Facebook's "backup your data feature" apparently) is using UTF-8 Hex values for the \uXXXX escape sequences, rather than UTF-16 as required by the standard.

There is no built-in way to tell System.Text.Json (or Json.NET for that matter) that the \uXXXX escape sequences use nonstandard values, however Utf8JsonReader provides access to the underlying, raw byte stream via the ValueSpan and ValueSequence properties, so it is possible to create a custom JsonConverter<string> that does the necessary decoding and unescaping itself.

First, create the following converter:

public class StringConverterForUtf8EscapedCharValues : JsonConverter<string>
{
    public override string? Read(ref Utf8JsonReader reader, Type typeToConvert, JsonSerializerOptions options)
    {
        if (reader.TokenType != JsonTokenType.String)
            throw new JsonException();

        if (!reader.ValueIsEscaped)
            return reader.GetString();

        ReadOnlySpan<byte> span = reader.HasValueSequence ? reader.ValueSequence.ToArray() : reader.ValueSpan;

        // Normally a JSON string will be a utf8 byte sequence with embedded utf18 escape codes.  
        // These improperly encoded JSON strings are utf8 byte sequences with embedded utf8 escape codes.

        var encoding = Encoding.UTF8;
        var decoder = encoding.GetDecoder();
        var sb = new StringBuilder();
        var maxCharCount = Encoding.UTF8.GetMaxCharCount(4);
        
        for (int i = 0; i < span.Length; i  )
        {
            if (span[i] != '\\')
            {
                Span<char> chars = stackalloc char[maxCharCount];
                var n = decoder.GetChars(span.Slice(i, 1), chars, false);
                sb.Append(chars.Slice(0, n));
            }
            else if (i < span.Length - 1 && span[i 1] == '"')
            {
                sb.Append('"');
                i  ;
            }
            else if (i < span.Length - 1 && span[i 1] == '\\')
            {
                sb.Append('\\');
                i  ;
            }
            else if (i < span.Length - 1 && span[i 1] == '/')
            {
                sb.Append('/');
                i  ;
            }
            else if (i < span.Length - 1 && span[i 1] == 'b')
            {
                sb.Append('\u0008');
                i  ;
            }
            else if (i < span.Length - 1 && span[i 1] == 'b')
            {
                sb.Append('\u0008');
                i  ;
            }
            else if (i < span.Length - 1 && span[i 1] == 'f')
            {
                sb.Append('\u0008');
                i  ;
            }
            else if (i < span.Length - 1 && span[i 1] == 'f')
            {
                sb.Append('\u000C');
                i  ;
            }
            else if (i < span.Length - 1 && span[i 1] == 'f')
            {
                sb.Append('\u000C');
                i  ;
            }
            else if (i < span.Length - 1 && span[i 1] == 'n')
            {
                sb.Append('\n');
                i  ;
            }
            else if (i < span.Length - 1 && span[i 1] == 'r')
            {
                sb.Append('\r');
                i  ;
            }
            else if (i < span.Length - 1 && span[i 1] == 't')
            {
                sb.Append('\t');
                i  ;
            }
            else if (i < span.Length - 5 && span[i 1] == 'u')
            {
                Span<char> hexchars = stackalloc char[4] { (char)span[i 2], (char)span[i 3], (char)span[i 4], (char)span[i 5] };
                if (!byte.TryParse(hexchars, NumberStyles.HexNumber, NumberFormatInfo.InvariantInfo, out var b))
                {
                    throw new JsonException();
                }
                Span<char> chars = stackalloc char[maxCharCount];
                Span<byte> bytes = stackalloc byte[1] { b };
                var n = decoder.GetChars(bytes, chars, false);
                sb.Append(chars.Slice(0, n));
                i  = 5;
            }
            else
            {
                throw new JsonException();
            }
        }
        var s = sb.ToString();
        return s;
    }

    public override void Write(Utf8JsonWriter writer, string value, JsonSerializerOptions options) => writer.WriteStringValue(value);
}

And now you will be able to do

var options = new JsonSerializerOptions
{
    Converters = { new StringConverterForUtf8EscapedCharValues() },
};
var msg = JsonSerializer.Deserialize<Msg>(data, options);
Assert.That(msg?.SenderName?.StartsWith("Tamás") == true); // Succeeds
Console.WriteLine(msg?.SenderName); // Prints Tamás

Notes:

Since a JSON file is generally a UTF-8 encoded character stream, decoding a single string literal in a well-formed JSON file can require decoding a mixture of UTF-8 and UTF-16 values.
The converter may not work if the underlying byte stream was not encoded using UTF-8.
Writing with (incorrect) UTF-8 values for escaped characters is not implemented.
The incorrect escaped values should be fixed before the JSON string literal is decoded to a c# string because the presence or absence of escape sequences is lost once decoding and unescaping are complete.
I haven't tested performance. It might be more performant to use the decoder returned by Encoding.UTF8.GetDecoder() to decode in chunks, rather than byte-by-byte as is done in this prototype.

Demo fiddle here.

^[1] Characters not in the Basic Multilingual Plane should use two sequential escape sequences, e.g. \uD834\uDD1E