Example:
func main() {
byteSlice := []byte{226, 140, 152, 97, 98, 99}
fmt.Println(string(byteSlice))
}
prints out:
⌘abc
Under the hood, how did Go know that the first three bytes - 226, 140, 152
- should be grouped together as a single uint32 rune: ⌘
, while the remaining bytes should be converted to three separate runes: a
, b
, and c
, respectively?
CodePudding user response:
By decoding the UTF-8 encoding into UTF-32.
A simple matter of looking at the leading bits of each octet, masking out the sentinel bits, and combining the data bits with bit shifts and bitwise OR.
Code point ↔ UTF-8 conversion
First code point | Last code point | Byte 1 | Byte 2 | Byte 3 | Byte 4 |
---|---|---|---|---|---|
U 0000 | U 007F | 0xxxxxxx | |||
U 0080 | U 07FF | 110xxxxx | 10xxxxxx | ||
U 0800 | U FFFF | 1110xxxx | 10xxxxxx | 10xxxxxx | |
U 10000 | U 10FFFF | 11110xxx | 10xxxxxx | 10xxxxxx | 10xxxxxx |
There's rather more to it than that, though, due to various Unicode normalization forms, and the possible presence of combining marks (e.g., e
/U 0065 followed by the combining mark ´
/U 0301 results in the Unicode code point (rune) for é
/U 00E9.
Interestingly, if you look at the sources for unicode/utf8
, it appears that DecodeRune()
and DecodeRuneInString()
https://cs.opensource.google/go/go/ /refs/tags/go1.19:src/unicode/utf8/utf8.go;l=151
https://cs.opensource.google/go/go/ /refs/tags/go1.19:src/unicode/utf8/utf8.go;l=199
it would seem, as the code does nothing with respect to combining marks, that it has an underlying assumption that the octets in the string are in Unicode Normalization Form C (Canonical Decomposition followed by Canonical Composition), so you'd never see combining marks.