Home > database >  How to convert weird Unicode into proper Unicode?
How to convert weird Unicode into proper Unicode?

Time:09-27

(There is some kind of bug with the site. Not sure what is going on. The other question appears to be gone, yet exist in the system. That's why I repost it.)

  1. Run this: var_dump(urldecode('Test: åäö'));
  2. The result will appear to look like (or close to): Test: åäö.
  3. Now place the mouse cursor between "å" and "ä" and press the left arrow key once.
  4. The cursor will not be on the left side of "å", as expected, because it contains "invisible characters". This is the issue.

How do I turn that "weird" form of Unicode with "partial building blocks" into the literal string Test: åäö, where each of "å", "ä" and "ö" are just one character?

I do not mean converting it to some other charset. Apparently, the first string I have (which comes from the outside) is valid Unicode, but very problematic such. I need the nice, logical, one-char-per-actual-char Unicode which doesn't cause mayhem on my system.

CodePudding user response:

Unicode provides several ways to encode character combinations like å. Common combinations have a dedicated character (in this example, U 00E5 'LATIN SMALL LETTER A WITH RING ABOVE') or you can store a sequence of a base character preceded or followed by one or more combining characters (in this example, a good old U 0061 'LATIN SMALL LETTER A' followed by U 030A COMBINING RING ABOVE).

Perhaps it's easy to understand if we print the characters using HTML entities:

å
å

Both encodings will render the same exact character and, for most usages, you don't need to care about which one is being used.

If you somehow need to care (perhaps you want to allow passwords with arbitrary Unicode characters and you need to ensure you get the same hash no matter how the character was typed) you can do normalization. In other words, transform input so you get the same characters for each equivalent grapheme cluster. PHP has the Normalizer class to aid with that:

$a = "\u{0061}\u{030A}";
$b = "\u{00e5}";
$normalized_a = Normalizer::normalize($a);
$normalized_b = Normalizer::normalize($b);

Users will see the same display:

var_dump(
    $a,
    $b,
    $normalized_a,
    $normalized_b
);
string(3) "å"
string(2) "å"
string(2) "å"
string(2) "å"

But only normalized strings are guaranteed to be identical:

var_dump(
    bin2hex($a),
    bin2hex($b),
    bin2hex($normalized_a),
    bin2hex($normalized_b),
);
string(6) "61cc8a"
string(4) "c3a5"
string(4) "c3a5"
string(4) "c3a5"
  • Related