Home > Net >  Why is the decoded byte array different than the original?
Why is the decoded byte array different than the original?

Time:12-30

When I generate a random byte sequence, decode the sequence into a string representation, then encode it back to a byte array, it is different from the original encoded sequence. See example below:

[byte[]]$key = [byte[]]::new(32)
[System.Security.Cryptography.RandomNumberGenerator]::Create().GetBytes($key)
$key

output: 15 173 198 89 162 161 144 104 125 86 154 204 166 238 193 40 51 58 167 0 150 118 37 203 198 161 64 229 101 25 176 201

$decoded = [System.Text.Encoding]::UTF8.GetString($key)
$encoded = [System.Text.Encoding]::UTF8.GetBytes($decoded)
$encoded

output: 15 239 191 189 239 191 189 89 239 191 189 239 191 189 239 191 189 104 125 86 239 191 189 204 166 239 191 189 239 191 189 40 51 58 239 191 189 0 239 191 189 118 37 239 191 189 198 161 64 239 191 189 101 25 239 191 189 239 191 189

The byte sequence was clearly modified after decoding/encoding. This process works fine if I use [System.Text.Encoding]::Unicode.... It seems that UTF8 can't handle certain bytes, but I was under the impression that UTF8 should be able to handle any character in the unicode standard. Can someone explain why this happens? Please and thanks

CodePudding user response:

I'm not nearly an expert on encodings but here are few notes:

  1. From Encoding.UTF8 docs:

    This property returns a UTF8Encoding object that encodes Unicode (UTF-16-encoded) characters into a sequence of one to four bytes per character, and that decodes a UTF-8-encoded byte array to Unicode (UTF-16-encoded) characters.

  2. Not every possible single byte represents a single valid character in UTF-8 encoding. UTF-8 is a variable-width character encoding standard that uses between one and four eight-bit bytes to represent all valid Unicode code points. If you check wiki article for encoding explanation you will see that single byte handles only 128 code points (0-127), so following will already "break" the encode-decode:

    var s = Encoding.UTF8.GetString(new byte[] { 128 });
    var bytes1 = Encoding.UTF8.GetBytes(s); // [239, 191, 189]
    
  3. Personally I would try using Convert.ToBase64String()/Convert.FromBase64String() (or Convert.ToHexString()/Convert.FromHexString() if available) pair to encode-decode.

  • Related