i have this two example strings:
$a = 'Anão'; $b = 'Anão';
They visually look the same, but the 3rd character is different:
On string $a is Unicode 227 (latin small letter a with tilde) and on string $b is Unicode 97 (latin small letter a) Unicode 771 (combining tilde)
How can i detect if a string contains any combining character, rather than the "regular" one?
I have tried to check all characters from the string with function "ord()" but it didn't work.
CodePudding user response:
Be aware that ord
operates in the ASCII range, matching characters in single-byte encoding only, and will not help you with multibyte Unicode characters outside the 0-255 range.
How to Match Combined Diacritics
You can use preg_match
with the Unicode u
flag, and then match the appropriate Unicode character range. In this case, \p{M}
will do the job. It stands for:
\p{M} or \p{Mark}: a character intended to be combined with another character (e.g. accents, umlauts, enclosing boxes, etc.).
Applied as follows:
$a = 'Anão';
$b = 'Anão';
var_dump([
preg_match('~\p{M}~u', $a), // = 0
preg_match('~\p{M}~u', $b) // = 1
]);
Returns 0
and 1
: Your $b
string has a combining diacritical mark. Then, you would check if(preg_match('~\p{M}~u', $str))
to find out if a string has combining diacritics.
This would match all types of combining diacritics. If you wanted to target the exact character class the combining umlaut diacritic belongs to, it'd be in the {Mn}
range:
\p{Mn} or \p{Non_Spacing_Mark}: a character intended to be combined with another character without taking up extra space (e.g. accents, umlauts, etc.).
How to Normalize Diacritics
If your question stems from "how do I make these strings equivalent", because when $a != $b
even though they look the same, it's obviously problematic. PHP has a convenient Normalizer class for converting Unicode strings to their canonical forms. Used as follows:
Normalizer::normalize('Anão', Normalizer::NFC); // Single Char, Default
Normalizer::normalize('Anão', Normalizer::NFD); // Combined
Here, NFC (default), or Normalization Form C, stands for "Canonical Decomposition, followed by Canonical Composition", where the character is first split to its parts, and then composed as far as possible, often into a single character. Again, NFD, Normalization Form D (NFD), stands for "Canonical Decomposition", where diacritics become separate combining characters, etc.
If you normalized all strings that potentially contain diacritics, both in your source data and in queries made against it, I suspect your original question would not arise.
P.S. See regular-expressions.info for a useful Unicode reference for Regex cheat sheet, and the Unicode character property / Categories table at Wikipedia.
CodePudding user response:
You can do a bunch of comparisons to check the equality.
$a = 'Anão';
$b = 'Anão';
$c = iconv('UTF-8', 'ASCII//TRANSLIT', $a);
$d = iconv('UTF-8', 'ASCII//TRANSLIT', $b);
echo ($c === $d ? 'same meaning' : 'different meaning'), PHP_EOL;
echo ($a === $b ? 'same string' : 'different string'), PHP_EOL;
echo ($a === $c ? 'a has no encoded characters' : 'a has encoded characters'), PHP_EOL;
echo ($b === $d ? 'b has no encoded characters' : 'b has encoded characters'), PHP_EOL;
Output
same meaning
different string
a has encoded characters
b has encoded characters