Home > Enterprise >  Should I delete blank values in utf-16 encoding?
Should I delete blank values in utf-16 encoding?

Time:09-30

When I read all the bytes from a string using Encoding.Unicode, It gives me blank (0) values.

When I run this code:

byte[] value = Encoding.Unicode.GetBytes("Hi");

It gives me the output

72
0
105
0

I know this is because UTF-16 stores 2 bytes and the 0 is just the second byte, but my question is should i delete the 0's? since as far as I know, they do not do anything and my program requires to loop through the array so the 0's would only make it slower.

CodePudding user response:

No, you must not delete bytes from a text encoding, because then you end up with garbage that can no longer be considered a valid encoding of the text.

If you have many ASCII characters and a few non-ASCII characters, you are probably better off with the UTF-8 encoding instead of UTF-16.

UTF-8 encodes to a single byte for ASCII chars and uses 2-4 bytes for non-ASCII chars.

Here's an illustrative example:

var text = "ö";
Console.WriteLine(string.Join(",", Encoding.Unicode.GetBytes(text))); // 246,0
Console.WriteLine(string.Join(",", Encoding.UTF8.GetBytes(text))); // 195,182

Identical text/character/letter, different encoding

  • Related