Home > Software design >  Golang strings.EqualFold gives unexpected results
Golang strings.EqualFold gives unexpected results

Time:11-04

In golang (go1.17 windows/amd64) the program below gives the following result:

rune1 = U 0130 'İ'
rune2 = U 0131 'ı'
lower(rune1) = U 0069 'i'
upper(rune2) = U 0049 'I'
strings.EqualFold(İ, ı) = false
strings.EqualFold(i, I) = true

I thought that strings.EqualFold would check strings for equality under Unicode case folding; however, the above example seem to give a counter-example. Clearly both runes can be folded (by hand) into code points that are equal under case folding.

Question: is golang correct that strings.EqualFold(İ, ı) is false? I expected it to yield true. And if golang is correct, why would that be? Or is this behaviour according to some Unicode specification.

What am I missing here.


Program:

func TestRune2(t *testing.T) {
   r1 := rune(0x0130) // U 0130 'İ'
   r2 := rune(0x0131) // U 0131 'ı'
   r1u := unicode.ToLower(r1)
   r2u := unicode.ToUpper(r2)

   t.Logf("\nrune1 = %#U\nrune2 = %#U\nlower(rune1) = %#U\nupper(rune2) = %#U\nstrings.EqualFold(%s, %s) = %v\nstrings.EqualFold(%s, %s) = %v",
      r1, r2, r1u, r2u, string(r1), string(r2), strings.EqualFold(string(r1), string(r2)), string(r1u), string(r2u), strings.EqualFold(string(r1u), string(r2u)))
}

CodePudding user response:

Yes, this is "correct" behaviour. These letters do not behave normal under case folding. See: http://www.unicode.org/Public/UCD/latest/ucd/CaseFolding.txt

U 0131 has full case folding "F" and special "T":

T: special case for uppercase I and dotted uppercase I
   - For non-Turkic languages, this mapping is normally not used.
   - For Turkic languages (tr, az), this mapping can be used instead
     of the normal mapping for these characters.
     Note that the Turkic mappings do not maintain canonical equivalence
     without additional processing.
     See the discussions of case mapping in the Unicode Standard for more information.

I think there is no way of to force package strings to use the tr or az mapping.

CodePudding user response:

From the strings.EqualFold source - unicode.ToLower and unicode.ToLower are not used.

Instead, it uses unicode.SimpleFold to see if a particular rune is "foldable" and therefore potentially comparable:

// General case. SimpleFold(x) returns the next equivalent rune > x
// or wraps around to smaller values.
r := unicode.SimpleFold(sr)
for r != sr && r < tr {
    r = unicode.SimpleFold(r)
}

The rune 'İ' is not foldable. It's lowercase code-point is:

r := rune(0x0130)        // U 0130 'İ'
lr := unicode.ToLower(r) // U 0069 'i'

fmt.Printf("foldable? %v\n", r != unicode.SimpleFold(r)) // foldable? false
fmt.Printf("foldable? %v\n", lr != unicode.SimpleFold(lr)) // foldable? true

https://play.golang.org/p/105x0I714nS

  • Related