Home > Mobile >  How to find unicode character class in PHP
How to find unicode character class in PHP

Time:02-13

I am having hard times finding a way to get the unicode class of a char.

list of unicode classes: https://www.php.net/manual/en/regexp.reference.unicode.php

The desired function in python: https://docs.python.org/3/library/unicodedata.html#unicodedata.category

I just want the PHP equivalent to this python function.

For example, if I called the x function like this: x('-') it would return Pd because Pd is the class hyphen belongs to.

Thanks.

CodePudding user response:

So Apparently there is no built-in function that does that, so I wrote this function:

<?php
$UNICODE_CATEGORIES = [
        "Cc",
        "Cf",
        "Cs",
        "Co",
        "Cn",
        "Lm",
        "Mn",
        "Mc",
        "Me",
        "No",
        "Zs",
        "Zl" ,
        "Zp",
        "Pc",
        "Pd",
        "Ps" ,
        "Pe" ,
        "Pi" ,
        "Pf" ,
        "Po" ,
        "Sm",
        "Sc",
        "Sk",
        "So",
        "Zs",
        "Zl",
        "Zp"
    ];

function uni_category($char, $UNICODE_CATEGORIES) {
    foreach ($UNICODE_CATEGORIES as $category) {
        if (preg_match('/\p{'.$category.'}/', $char))
            return $category;
    } 
    return null;
}
// call the function 
print uni_category('-', $UNICODE_CATEGORIES); // it returns Pd

This code works for me, I hope it helps someby in the future :).

CodePudding user response:

A possible way is to use IntlChar::charType. Unfortunately, this method returns only an int, but this int is a constant defined in the IntlChar class. All the constants for the 30 categories are in a 0 to 29 range (no gaps). Conclusion, all you have to do is to build a indexed array that follows the same order:

$shortCats = [
    'Cn', 'Lu', 'Ll', 'Lt', 'Lm', 'Lo',
    'Mn', 'Me', 'Mc', 'Nd', 'Nl', 'No',
    'Zs', 'Zl', 'Zp', 'Cc', 'Cf', 'Co',
    'Cs', 'Pd', 'Ps', 'Pe', 'Pc', 'Po',
    'Sm', 'Sc', 'Sk', 'So', 'Pi', 'Pf'
];

echo $shortCats[IntlChar::charType('-')]; //Pd

Notice: If you are afraid that the numeric values defined in the class change in the futur and want to be more rigorous, You can also write the array this way:

$shortCats = [
    IntlChar::CHAR_CATEGORY_UNASSIGNED => 'Cn',
    IntlChar::CHAR_CATEGORY_UPPERCASE_LETTER => 'Lu',
    IntlChar::CHAR_CATEGORY_LOWERCASE_LETTER => 'Ll',
    IntlChar::CHAR_CATEGORY_TITLECASE_LETTER => 'Lt',
    // etc.
];
  • Related