When I read all the bytes from a string using Encoding.Unicode, It gives me blank (0) values.
When I run this code:
byte[] value = Encoding.Unicode.GetBytes("Hi");
It gives me the output
72
0
105
0
I know this is because UTF-16 stores 2 bytes and the 0 is just the second byte, but my question is should i delete the 0's? since as far as I know, they do not do anything and my program requires to loop through the array so the 0's would only make it slower.
CodePudding user response:
No, you must not delete bytes from a text encoding, because then you end up with garbage that can no longer be considered a valid encoding of the text.
If you have many ASCII characters and a few non-ASCII characters, you are probably better off with the UTF-8 encoding instead of UTF-16.
UTF-8 encodes to a single byte for ASCII chars and uses 2-4 bytes for non-ASCII chars.
Here's an illustrative example:
var text = "ö";
Console.WriteLine(string.Join(",", Encoding.Unicode.GetBytes(text))); // 246,0
Console.WriteLine(string.Join(",", Encoding.UTF8.GetBytes(text))); // 195,182
Identical text/character/letter, different encoding