Is non-English string still 'a read-only slice of bytes'?-CodePudding

It's mentioned in https://go.dev/blog/strings that:

In Go, a string is in effect a read-only slice of bytes.

To my understanding, byte data type is equivalent to uint8 in Go that represents the ASCII characters which works perfectly with strings that consists of English letters only.

For non-English string, such as Japanese, Korean, Chinese, Arabic etc, is it still correct to say "In Go, a string is in effect a read-only slice of bytes."?

Or can I say "In Go, a non-English string is in effect a read-only slice of Rune" because apparently ASCII does not support the strings with Japanese, Korean, Chinese, Arabic characters which must be represented in Unicode or UTF-8 using Rune.

CodePudding user response：

To my understanding, byte data type is equivalent to uint8 in Go that represents the ASCII characters which works perfectly with strings that consists of English letters only.

No. Byte doesn't mean ASCII. Go doesn't use ASCII for anything.

Strings in Go are normally UTF-8. The string functions in the standard library all work with UTF-8. Accessing a string as a series of runes using range assumes that the string is UTF-8. UTF-8 is an encoding of Unicode into bytes. All of this is true regardless of what language you're working with.

Strings can also contain data that isn't UTF-8; as the article you quoted said, a string is basically just an immutable []byte, and can contain any sequence of bytes, including binary data, and character data in other encodings than UTF-8. This is perfectly valid; it just doesn't make sense to use strings functions or range on these "strings". The types really only capture the difference between mutable and immutable; they don't capture the difference between "a character string" and "a bunch of bytes".

CodePudding user response：

Yes, string will be a slice of bytes regardless of the charset. For example:

s := "селёдка"
fmt.Printf("%d\n", len(s))

will print 14 even though the word is 7 letters long. That means, you cannot e.g. use s[2] to get the third characters.

However, when you're iterating over a string, you are getting runes:

s := "селёдка"
for _, c := range s {
    fmt.Printf("%s\n", c)
}

will print the word letter by letter.

If you want to deal with the runes directly, convert the string to the slice:

r := []rune(s)