Home > Blockchain >  Unicode normalization forms Explanation (Java)
Unicode normalization forms Explanation (Java)

Time:10-20

I'm using Normalizer.normalize(url, Normalizer.Form.NFD) to avoid having characters like é in my url, and I do not understand the meaning of the Normalizer.Form consts (NFC, NFD, NFKC, and NFKD) or when to use each one.

I consulted the documentation but this did not help at all. enter image description here

Does anyone have any idea on this topic?

Thanks in advance!

CodePudding user response:

  • D=Decomposed e ´
  • C=Composed é

The K is for ligatures, one letter (ffi) or 3: f f i.

This is mentioned in the javadoc:

Characters with accents or other adornments can be encoded in several different ways in Unicode. For example, take the character A-acute. In Unicode, this can be encoded as a single character (the "composed" form):

  U 00C1    LATIN CAPITAL LETTER A WITH ACUTE or as two separate characters (the "decomposed" form):
  U 0041    LATIN CAPITAL LETTER A
  U 0301    COMBINING ACUTE ACCENT To a user of your program, however, both of these sequences should be treated as the same

"user-level" character "A with acute accent". When you are searching or comparing text, you must ensure that these two sequences are treated as equivalent. In addition, you must handle characters with more than one accent. Sometimes the order of a character's combining accents is significant, while in other cases accent sequences in different orders are really equivalent. Similarly, the string "ffi" can be encoded as three separate letters:

  U 0066    LATIN SMALL LETTER F
  U 0066    LATIN SMALL LETTER F
  U 0069    LATIN SMALL LETTER I or as the single character
  U FB03    LATIN SMALL LIGATURE FFI

So in your case you want NFKD, full decomposition.

s = Normalizer.normalize(s, Normalizer.Form.NFD).replaceAll("\\p{M}", "");

The latter replaceAlljust removes the combining diacritical marks, the zero-width accents ´. There are still problematic latin letters like

  • ŀ Polish small L with strike-through
  • ı Turkish small I without dot
  • İ Turkish capital I with dot

But might already been doing a non-ASCII replace.

Of course nowadays one might have Unicode URLs to some degree, sites with special characters. And with some care those characters would not get mangled.

An other use of normalisation in decomposed form is for sorting country names alphabetically: Österreich (Austria in German) before P.


Some Details

The K stands for "compatibility" and hence is important.

One can have more than one accent (zero-width combining diacritical mark) at a letter.

One can have a String with both composed and decomposed letters.

So actually NFC does: Canonical decomposition, followed by canonical composition. So in order to do a good composition it is best to first decompose which does the Normalizer for you.

Composition also has its use; for instance it is guaranteed canonical (single norming form), and is compact for String.codePointAt.

  • Related