I'm using Normalizer.normalize(url, Normalizer.Form.NFD)
to avoid having characters like é
in my url, and I do not understand the meaning of the Normalizer.Form
consts (NFC, NFD, NFKC, and NFKD) or when to use each one.
I consulted the documentation but this did not help at all.
Does anyone have any idea on this topic?
Thanks in advance!
CodePudding user response:
- D=Decomposed
e
´
- C=Composed
é
The K is for ligatures, one letter ffi
(ffi) or 3: f f i.
This is mentioned in the javadoc:
Characters with accents or other adornments can be encoded in several different ways in Unicode. For example, take the character A-acute. In Unicode, this can be encoded as a single character (the "composed" form):
U 00C1 LATIN CAPITAL LETTER A WITH ACUTE or as two separate characters (the "decomposed" form): U 0041 LATIN CAPITAL LETTER A U 0301 COMBINING ACUTE ACCENT To a user of your program, however, both of these sequences should be treated as the same
"user-level" character "A with acute accent". When you are searching or comparing text, you must ensure that these two sequences are treated as equivalent. In addition, you must handle characters with more than one accent. Sometimes the order of a character's combining accents is significant, while in other cases accent sequences in different orders are really equivalent. Similarly, the string "ffi" can be encoded as three separate letters:
U 0066 LATIN SMALL LETTER F U 0066 LATIN SMALL LETTER F U 0069 LATIN SMALL LETTER I or as the single character U FB03 LATIN SMALL LIGATURE FFI
So in your case you want NFKD, full decomposition.
s = Normalizer.normalize(s, Normalizer.Form.NFD).replaceAll("\\p{M}", "");
The latter replaceAll
just removes the combining diacritical marks, the zero-width accents ´
. There are still problematic latin letters like
ŀ
Polish small L with strike-throughı
Turkish small I without dotİ
Turkish capital I with dot
But might already been doing a non-ASCII replace.
Of course nowadays one might have Unicode URLs to some degree, sites with special characters. And with some care those characters would not get mangled.
An other use of normalisation in decomposed form is for sorting country names alphabetically: Österreich
(Austria in German) before P
.
Some Details
The K stands for "compatibility" and hence is important.
One can have more than one accent (zero-width combining diacritical mark) at a letter.
One can have a String with both composed and decomposed letters.
So actually NFC does: Canonical decomposition, followed by canonical composition. So in order to do a good composition it is best to first decompose which does the Normalizer for you.
Composition also has its use; for instance it is guaranteed canonical (single norming form), and is compact for String.codePointAt
.