Home > Net >  How does the Go compiler know which bytes in a byte slice should be grouped together into one rune?
How does the Go compiler know which bytes in a byte slice should be grouped together into one rune?

Time:08-13

Example:

func main() {
    byteSlice := []byte{226, 140, 152, 97, 98, 99}
    fmt.Println(string(byteSlice))
}

prints out:

⌘abc

Under the hood, how did Go know that the first three bytes - 226, 140, 152 - should be grouped together as a single uint32 rune: , while the remaining bytes should be converted to three separate runes: a, b, and c, respectively?

CodePudding user response:

By decoding the UTF-8 encoding into UTF-32.

A simple matter of looking at the leading bits of each octet, masking out the sentinel bits, and combining the data bits with bit shifts and bitwise OR.

Code point ↔ UTF-8 conversion

First code point Last code point Byte 1 Byte 2 Byte 3 Byte 4
U 0000 U 007F 0xxxxxxx
U 0080 U 07FF 110xxxxx 10xxxxxx
U 0800 U FFFF 1110xxxx 10xxxxxx 10xxxxxx
U 10000 U 10FFFF 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

There's rather more to it than that, though, due to various Unicode normalization forms, and the possible presence of combining marks (e.g., e/U 0065 followed by the combining mark ´/U 0301 results in the Unicode code point (rune) for é/U 00E9.

Interestingly, if you look at the sources for unicode/utf8, it appears that DecodeRune() and DecodeRuneInString()

it would seem, as the code does nothing with respect to combining marks, that it has an underlying assumption that the octets in the string are in Unicode Normalization Form C (Canonical Decomposition followed by Canonical Composition), so you'd never see combining marks.

  • Related