I am handling input in Greek language, where vowels can have accents.
I noticed some strange input in words that contains vowels with accents, sometimes the same vowel with accent seems to be two separate characters while other times the same vowel with accent seems to be one character, I guess a different charset encoding is guilty about that behaviour.
Example shown below
έ -----> is two separate characters, ε and the ́
έ -----> is a single character έ
My questions regarding the behaviour described above are the following:
- What is the root cause of this phenomenon?
- How could I possibly convert all these two characters toned vowels into single character toned vowel? (for example convert έ into έ), is there any "global way" to deal with that kind of encoding problems?
Currently as a solution what I do is replace any possible two characters vowel into single character as following:
text = text.replaceAll("ά", "ά")
.replaceAll("έ", "έ")
.replaceAll("ή", "ή")
.replaceAll("ί", "ί")
.replaceAll("ύ", "ύ")
.replaceAll("ό", "ό")
.replaceAll("ώ", "ώ")
.replaceAll("Ά", "Ά")
.replaceAll("Έ", "Έ")
.replaceAll("Ή", "Ή")
.replaceAll("Ί", "Ί")
.replaceAll("Ύ", "Ύ")
.replaceAll("Ό", "Ό")
.replaceAll("Ώ", "Ώ");
but There should be a better way to achieve that, I use Java for this text handling
CodePudding user response:
The root cause: Sometime there is many different ways to represent the same glyph with Unicode. Usually we convert to a canonical form, but there is two canonical/normalization form (decomposed: NFD and composed: NFC). Apple prefers the first (and it was the original prefered way of Unicode), most of the other operating systems prefer the second. And each font has own preference (but shaper library will handle it).
You can transform your text into the canonical composed form (NFC), but not all glyphs can be transformed into one single characters: some combination of accent and base character requires two codepoints (or more if you have multiple accents).
CodePudding user response:
Due to the complexity of Unicode, there are multiple ways of encoding the same text. You can encode ε with an acute accent as the single character "GREEK SMALL LETTER EPSILON WITH TONOS" (U 03AD), or as "GREEK SMALL LETTER EPSILON" (U 03B5) followed by "COMBINING ACUTE ACCENT" (U 0301). And different people and software do encode these differently sometimes.
To convert to the "more compact" encoding, you can use the java.text.Normalizer
class, and Normalisation Form C (NFC).
// you can pass the entire string into this:
Normalizer.normalize("ε\u0301", Normalizer.Form.NFC) // produces a string with a \u03AD char
The less compact encoding is called NFD.