Home > Software design >  Is UTF-16 encoding handles data compression by default?
Is UTF-16 encoding handles data compression by default?

Time:12-28

I've unicode char When I convert id to data,

UTF 8 -> Size: 3 bytes Array: [224, 174, 164]

UTF 16 -> Size: 4 bytes Array: [2980]

Seems pretty simple UTF8 tooks 1 byte per code and UTF16 takes 4 bytes per code. But, If I use "தததத" using Swift programming language in macOS,

let tamil = "தததத"
         
let utf8Data = tamil.data(using: .utf8)!
let utf16Data = tamil.data(using: .utf16)!

print("UTF 8 -> Size: \(utf8Data.count) bytes Array: \(tamil.utf8.map({$0}))")
print("UTF 16 -> Size: \(utf16Data.count) bytes Array: \(tamil.utf16.map({$0}))")

Then the output is

UTF 8 -> Size: 12 bytes Array: [224, 174, 164, 224, 174, 164, 224, 174, 164, 224, 174, 164]

UTF 16 -> Size: 10 bytes Array: [2980, 2980, 2980, 2980]

The UTF16 data for "தததத" => 4x4 = 16 bytes. But it is 10 bytes only still have 4 codes in the array. Why it is? Where the 6 bytes gone?

CodePudding user response:

The actual byte representation of those strings is this:

UTF-8:

e0ae a4e0 aea4 e0ae a4e0 aea4

UTF-16:

feff 0ba4 0ba4 0ba4 0ba4

The UTF-8 representation is e0aea4 times four.
The UTF-16 representation is 0ba4 times four plus one leading BOM feff.

UTF-16 text should start with a BOM, but this is only required once at the start of the string, not once for each character.

  • Related