How does the Go compiler know which bytes in a byte slice should be grouped together into one rune?-CodePudding

Example:

func main() {
    byteSlice := []byte{226, 140, 152, 97, 98, 99}
    fmt.Println(string(byteSlice))
}

prints out:

⌘abc

Under the hood, how did Go know that the first three bytes - 226, 140, 152 - should be grouped together as a single uint32 rune: ⌘, while the remaining bytes should be converted to three separate runes: a, b, and c, respectively?

CodePudding user response：

By decoding the UTF-8 encoding into UTF-32.

A simple matter of looking at the leading bits of each octet, masking out the sentinel bits, and combining the data bits with bit shifts and bitwise OR.

Code point ↔ UTF-8 conversion

First code point	Last code point	Byte 1	Byte 2	Byte 3	Byte 4
U 0000	U 007F	0xxxxxxx
U 0080	U 07FF	110xxxxx	10xxxxxx
U 0800	U FFFF	1110xxxx	10xxxxxx	10xxxxxx
U 10000	U 10FFFF	11110xxx	10xxxxxx	10xxxxxx	10xxxxxx

There's rather more to it than that, though, due to various Unicode normalization forms, and the possible presence of combining marks (e.g., e/U 0065 followed by the combining mark ´/U 0301 results in the Unicode code point (rune) for é/U 00E9.

Interestingly, if you look at the sources for unicode/utf8, it appears that DecodeRune() and DecodeRuneInString()

it would seem, as the code does nothing with respect to combining marks, that it has an underlying assumption that the octets in the string are in Unicode Normalization Form C (Canonical Decomposition followed by Canonical Composition), so you'd never see combining marks.