Home > OS >  Normalize a unicode string in PHP
Normalize a unicode string in PHP

Time:11-05

In PHP,

mb_strtolower('İspanyolca');

returns

U 0069  i  LATIN SMALL LETTER I
U 0307  ̇   COMBINING DOT ABOVE
U 0073  s  LATIN SMALL LETTER S
U 0070  p  LATIN SMALL LETTER P
etc.

I need to get rid of the "U 0307 ̇ COMBINING DOT ABOVE";

I tried this:

$TheUrl=mb_strtolower('İspanyolca');
$TheUrl=normalizer_normalize($TheUrl,Normalizer::FORM_C);

The combining dot above persists.

Any help would be appreciated.

CodePudding user response:

You can try a custom function in PHP that performs Unicode normalization and then remove characters that are not part of the basic Latin alphabet. So for example -

function removeDiacritics($str) {
    $normalizedStr = Normalizer::normalize($str, Normalizer::FORM_C);
    
    $cleanStr = preg_replace('/[^a-zA-Z]/', '', $normalizedStr);
    return $cleanStr;
}

$TheUrl = mb_strtolower('İspanyolca');
$TheUrl = removeDiacritics($TheUrl);
echo $TheUrl;

CodePudding user response:

To handle this case, you can use the strtr function to replace specific characters in the string like my example below

$TheUrl = 'İspanyolca';
$TheUrl = mb_strtolower($TheUrl, 'UTF-8');
$TheUrl = strtr($TheUrl, array('i̇' => 'i', 'İ' => 'i'));

This will replace the lowercase 'i' with a dot above and the uppercase 'İ' with a regular lowercase 'i'.

  • Related