Home > front end >  swift string diacriticInsensitive not working correct
swift string diacriticInsensitive not working correct

Time:10-17

I am doing diacritic conversion on string. In Swedish it converts the letters åäö to aao. but iphone keyboard has åäö these letters. I couldn't understand why it converted these 3 letters. Is there an error in my code? Shouldn't the letters on the keyboard be converted?

print("åäö".folding(options: .diacriticInsensitive, locale: Locale(identifier: "sv"))) -> output aao

my iphone keyboard: enter image description here

CodePudding user response:

'Folding' returns a string that you can compare against another string taking some features out consideration.

If you are comparing two strings and that comparison is diacriticInsensitive ignore diacritical marks like the umlaut in "ö" so that it will see "ö" and "o" as the same characters.

It's not clear to me why you are mentioning your keyboard. The keyboard is not related to the content of the strings.

Here is your code expanded with a call that compares the two strings above, ignoring diacritical marks

import Foundation

print("åäö".folding(options: .diacriticInsensitive, locale: Locale(identifier: "sv")))
print("aao".folding(options: .diacriticInsensitive, locale: Locale(identifier: "sv")))

if "åäö".compare("aao", options: .diacriticInsensitive, range: nil, locale: nil) == .orderedSame {
    print("They Match (ignoring diacritics)")
} else {
    print("As different as night and day")
}

CodePudding user response:

This precisely matches the meaning of diacriticInsensitive. UTR #30 covers this. "Diacritic removal" includes "stroke, hook, descender" and all other "diacritics" returning the "related base character." While in Swedish å is considered a separate letter for sorting purposes, it still has a "base character" of (Latin) a. (Similarly for ä and ö.) This is a complex problem in Swedish, but the results should not be surprising.

The ultimate rules are in Unicode's DiacriticFolding. These rules are not locale specific. It's possible that Foundation applies some additional locale rules, but clearly not in this case. The relevant Unicode folding rule is:

0061 030A;  0061    # å  a LATIN SMALL LETTER A, COMBINING RING ABOVE  LATIN SMALL LETTER A

Many cultures have subtle definitions of what is "a letter" vs "an extension of another letter" vs "a half-letter" vs "a non-letter symbol." When computing diacritics, the Turkish "İ" has a base form of "I", but "i" does not have a base form of "ı". That's bizarre, but true, because it's treating "basic latin" as the base alphabet. ("Basic Latin" is itself a bizarre classification, with letters j, u, and w being somewhat modern additions. But still we call it "Latin.")

Unicode tries to "thread the needle" on these complex issues, with varying success. It tends to be biased towards Romance languages (and particularly Western European countries). But it does try. And it has a focus on what users will expect. So should a search for "halla" find "Hallå." I'm betting that most Swedes would consider that "close enough."

Keyboards are designed to be useful to the cultures they're created for, so whether a particular symbol appears on the keyboard shouldn't be assumed to be making any strong claim about how the alphabet works. The iOS Arabic keyboard includes the half-letter "ء". That isn't making a claim about how the alphabet works. It's just saying that ء is somewhat commonly typed when writing Arabic.

  • Related