I've unicode char த
When I convert id to data,
UTF 8 -> Size: 3 bytes Array: [224, 174, 164]
UTF 16 -> Size: 4 bytes Array: [2980]
Seems pretty simple UTF8 tooks 1 byte per code and UTF16 takes 4 bytes per code. But, If I use "தததத" using Swift programming language in macOS,
let tamil = "தததத"
let utf8Data = tamil.data(using: .utf8)!
let utf16Data = tamil.data(using: .utf16)!
print("UTF 8 -> Size: \(utf8Data.count) bytes Array: \(tamil.utf8.map({$0}))")
print("UTF 16 -> Size: \(utf16Data.count) bytes Array: \(tamil.utf16.map({$0}))")
Then the output is
UTF 8 -> Size: 12 bytes Array: [224, 174, 164, 224, 174, 164, 224, 174, 164, 224, 174, 164]
UTF 16 -> Size: 10 bytes Array: [2980, 2980, 2980, 2980]
The UTF16 data for "தததத" => 4x4 = 16 bytes. But it is 10 bytes only still have 4 codes in the array. Why it is? Where the 6 bytes gone?
CodePudding user response:
The actual byte representation of those strings is this:
UTF-8:
e0ae a4e0 aea4 e0ae a4e0 aea4
UTF-16:
feff 0ba4 0ba4 0ba4 0ba4
The UTF-8 representation is e0aea4
times four.
The UTF-16 representation is 0ba4
times four plus one leading BOM feff
.
UTF-16 text should start with a BOM, but this is only required once at the start of the string, not once for each character.