I was running some fuzzing on my code and it found a bug. I have reduced it down to the following code snippet and I cannot see what is wrong.
Given the string
s := string("\xc0")
The len(s)
function returns 1
. However, if you loop through the string the first rune is length 3.
for _, r := range s {
fmt.Println("len of rune:", utf8.RuneLen(r)) // Will print 3
}
My assumptions are:
len(string)
is returning the number of bytes in the stringutf8.RuneLen(r)
is returning the number of bytes in the rune
I assume I am misunderstanding something, but how can the length of a string be less than the length of one of it's runes?
Playground here: https://go.dev/play/p/SH3ZI2IZyrL
CodePudding user response:
The explanation is simple: your input is not valid UTF-8 encoded string.
fmt.Println(utf8.ValidString(s))
This outputs: false
.
The for range
over a string
ranges over its runes, but if an invalid UTF-8 sequence is encountered, the Unicode replacement character 0xFFFD
is set for r
. Spec: For statements:
For a string value, the "range" clause iterates over the Unicode code points in the string starting at byte index 0. On successive iterations, the index value will be the index of the first byte of successive UTF-8-encoded code points in the string, and the second value, of type rune, will be the value of the corresponding code point. If the iteration encounters an invalid UTF-8 sequence, the second value will be
0xFFFD
, the Unicode replacement character, and the next iteration will advance a single byte in the string.
This applies to your case: you get 0xfffd
for r
which has 3 bytes using UTF-8 encoding.
If you go with a valid string holding a rune
of \xc0
:
s = string([]rune{'\xc0'})
Then output is:
len of s: 2
runes in s: 1
len of rune: 2
UTF-8 bytes of s: [195 128]
Hexa UTF-8 bytes of s: c3 80
Try it on the Go Playground.