Home > Enterprise >  Regular Expression for Japanese Full-Width Numbers Returning All Full Width Characters
Regular Expression for Japanese Full-Width Numbers Returning All Full Width Characters

Time:11-14

I am writing a PHP file that takes the contents of a web page, filters for full-width numbers, and converts them to half-width. Currently, my program returns all full-width characters on the page, not just the numbers.

<?php
$fullwidthPattern = '/([0-9])/';

$handle = curl_init();
 
$url = (URL removed for privacy reasons);

function getFullWidth(string $input) {
    global $fullwidthPattern;
    return preg_match($fullwidthPattern, $input);
}

curl_setopt($handle, CURLOPT_URL, $url);

curl_setopt($handle, CURLOPT_RETURNTRANSFER, true);
 
$output = curl_exec($handle);
 
curl_close($handle);

function jp_str_split($str) {
    $pattern = '/(?<!^)(?!$)/u';
    return preg_split($pattern,$str);
}

$jpContents = jp_str_split($output);

$numbers = array_filter($jpContents, 'getFullWidth');

foreach($numbers as $x) {
    echo $x;
}

My regular expression is currently '/([0-9])/', but I have also tried '/[0-9]/' and '/[0123456789]/'.

CodePudding user response:

Splitting should be done with

function jp_str_split($str) {
    preg_match_all('/\X/u', $str, $matches);
    return $matches[0]; 
}

The \X construct matches any Unicode grapheme in full, your (?<!^)(?!$) regex matches any location inside the string, even between bytes regardless of the u flag presence (it affects the chars you consume and not the locations inside the matched string).

Also, since you process Unicode numbers, you must also pass the u flag in the second regex:

$fullwidthPattern = '/([0-9])/u';
  • Related