I am having hard times finding a way to get the unicode class of a char.
list of unicode classes: https://www.php.net/manual/en/regexp.reference.unicode.php
The desired function in python: https://docs.python.org/3/library/unicodedata.html#unicodedata.category
I just want the PHP equivalent to this python function.
For example, if I called the x function like this: x('-') it would return Pd
because Pd is the class hyphen belongs to.
Thanks.
CodePudding user response:
So Apparently there is no built-in function that does that, so I wrote this function:
<?php
$UNICODE_CATEGORIES = [
"Cc",
"Cf",
"Cs",
"Co",
"Cn",
"Lm",
"Mn",
"Mc",
"Me",
"No",
"Zs",
"Zl" ,
"Zp",
"Pc",
"Pd",
"Ps" ,
"Pe" ,
"Pi" ,
"Pf" ,
"Po" ,
"Sm",
"Sc",
"Sk",
"So",
"Zs",
"Zl",
"Zp"
];
function uni_category($char, $UNICODE_CATEGORIES) {
foreach ($UNICODE_CATEGORIES as $category) {
if (preg_match('/\p{'.$category.'}/', $char))
return $category;
}
return null;
}
// call the function
print uni_category('-', $UNICODE_CATEGORIES); // it returns Pd
This code works for me, I hope it helps someby in the future :).
CodePudding user response:
A possible way is to use IntlChar::charType
. Unfortunately, this method returns only an int, but this int is a constant defined in the IntlChar
class. All the constants for the 30 categories are in a 0 to 29 range (no gaps). Conclusion, all you have to do is to build a indexed array that follows the same order:
$shortCats = [
'Cn', 'Lu', 'Ll', 'Lt', 'Lm', 'Lo',
'Mn', 'Me', 'Mc', 'Nd', 'Nl', 'No',
'Zs', 'Zl', 'Zp', 'Cc', 'Cf', 'Co',
'Cs', 'Pd', 'Ps', 'Pe', 'Pc', 'Po',
'Sm', 'Sc', 'Sk', 'So', 'Pi', 'Pf'
];
echo $shortCats[IntlChar::charType('-')]; //Pd
Notice: If you are afraid that the numeric values defined in the class change in the futur and want to be more rigorous, You can also write the array this way:
$shortCats = [
IntlChar::CHAR_CATEGORY_UNASSIGNED => 'Cn',
IntlChar::CHAR_CATEGORY_UPPERCASE_LETTER => 'Lu',
IntlChar::CHAR_CATEGORY_LOWERCASE_LETTER => 'Ll',
IntlChar::CHAR_CATEGORY_TITLECASE_LETTER => 'Lt',
// etc.
];