Home > Mobile >  Identify Encoding and Decode Byte Array to String
Identify Encoding and Decode Byte Array to String

Time:12-02

I have a function that returns a byte array that represents a JSON string I need to parse. Generally, I would use Encoding.Default.GetString(myByteArray) to convert it to a string, but the resulting string has some unrecognized characters in it: ?{"rects":[],"text":""} instead of {"rects":[],"text":""}.

I have tried using every other encoding scheme the Encoding class has (that I know of anyway): UTF8, UTF7, UTF32, Unicode, BigEndianUnicode, Latin1, and ASCII, but every single one resulted in a string with ?, ??, or ÿ_ at the beginning (or in the case of UTF32, the whole string was ?'s).

Strangely, using new StreamReader(new MemoryStream(myByteArray)).ReadToEnd() decoded the string perfectly, and is what I'm currently using in my code. I used StreamReader.CurrentEncoding to figure out what encoding it was using and printed it to the console (System.Text.UnicodeEncoding), then tried using new UnicodeEncoding().GetString(myByteArray), but still no luck.

How do I identify what encoding the byte arrays are using so I can decode it directly instead of wrapping it in streams?

// data is the example JSON string: {"rects":[],"text":""}
// In practice, the JSON strings are much longer.
var data = new byte[] { 255, 254, 123, 0, 34, 0, 114, 0, 101, 0, 99, 0, 116, 0, 115, 0, 34, 0, 58, 0, 91, 0, 93, 0, 44, 0, 34, 0, 116, 0, 101, 0, 120, 0, 116, 0, 34, 0, 58, 0, 34, 0, 34, 0, 125, 0 };

var ms = new MemoryStream(data);
var sr = new StreamReader(ms);

var text = sr.ReadToEnd();
Console.WriteLine(sr.CurrentEncoding);
Console.WriteLine(text);

var text2 = Encoding.Default.GetString(data);

Console.WriteLine(text2);

dynamic json = JsonConvert.DeserializeObject<dynamic>(text);

Console.WriteLine(json.text);
Console.WriteLine(json.rects);

Thanks!

CodePudding user response:

Well, you have UTF-16 with Byte Order Mark (BOM) which defines the encoding. In your case BOM is FE which is UTF-16 (LE):

var data = new byte[] { 
  255, 254, // <- BOM (UTF-16 (LE))
  123, 0, 34, 0, 114, 0, /* Payload */ };

So you can just get rid of BOM and decode the rest:

string result = Encoding.Unicode.GetString(data.AsSpan(2));

Note, that file readers (like StreamReader) can detect BOM, get the correct decoder and use it when reading the file.

CodePudding user response:

What are the first two bytes for? Look what characters 255 and 254 are https://ascii-tables.com/ just remove them and it should work fine

  • Related