I have a function that returns a byte array that represents a JSON string I need to parse. Generally, I would use Encoding.Default.GetString(myByteArray)
to convert it to a string, but the resulting string has some unrecognized characters in it: ?{"rects":[],"text":""}
instead of {"rects":[],"text":""}
.
I have tried using every other encoding scheme the Encoding
class has (that I know of anyway): UTF8
, UTF7
, UTF32
, Unicode
, BigEndianUnicode
, Latin1
, and ASCII
, but every single one resulted in a string with ?
, ??
, or ÿ_
at the beginning (or in the case of UTF32, the whole string was ?
's).
Strangely, using new StreamReader(new MemoryStream(myByteArray)).ReadToEnd()
decoded the string perfectly, and is what I'm currently using in my code. I used StreamReader.CurrentEncoding
to figure out what encoding it was using and printed it to the console (System.Text.UnicodeEncoding
), then tried using new UnicodeEncoding().GetString(myByteArray)
, but still no luck.
How do I identify what encoding the byte arrays are using so I can decode it directly instead of wrapping it in streams?
// data is the example JSON string: {"rects":[],"text":""}
// In practice, the JSON strings are much longer.
var data = new byte[] { 255, 254, 123, 0, 34, 0, 114, 0, 101, 0, 99, 0, 116, 0, 115, 0, 34, 0, 58, 0, 91, 0, 93, 0, 44, 0, 34, 0, 116, 0, 101, 0, 120, 0, 116, 0, 34, 0, 58, 0, 34, 0, 34, 0, 125, 0 };
var ms = new MemoryStream(data);
var sr = new StreamReader(ms);
var text = sr.ReadToEnd();
Console.WriteLine(sr.CurrentEncoding);
Console.WriteLine(text);
var text2 = Encoding.Default.GetString(data);
Console.WriteLine(text2);
dynamic json = JsonConvert.DeserializeObject<dynamic>(text);
Console.WriteLine(json.text);
Console.WriteLine(json.rects);
Thanks!
CodePudding user response:
Well, you have UTF-16 with Byte Order Mark (BOM) which defines the encoding. In your case BOM is FE
which is UTF-16 (LE)
:
var data = new byte[] {
255, 254, // <- BOM (UTF-16 (LE))
123, 0, 34, 0, 114, 0, /* Payload */ };
So you can just get rid of BOM and decode the rest:
string result = Encoding.Unicode.GetString(data.AsSpan(2));
Note, that file readers (like StreamReader
) can detect BOM, get the correct decoder and use it when reading the file.
CodePudding user response:
What are the first two bytes for? Look what characters 255 and 254 are https://ascii-tables.com/ just remove them and it should work fine