Consider this function
function Split_Sentence($string, $asalpha)
{
preg_match_all("~(?<han>\p{Han} )|(?<alpha>[a-z\d$asalpha] )|(?<other>\S )~ui", $string, $out)
foreach($out as $group_key=>$group)
{
if(!is_numeric($group_key))
{
// discard indexed groups
foreach($group as $i=>$v)
{
if(mb_strlen($v))
{
$res[$i]=['type'=>$group_key,'text'=>$v];
}
}
}
}
ksort($res);
return $res;
}
(where $ashalpha is series of character to be matched as "alpha" no matter what)
This function is used to parse a sentence and break it into groups of Han, Alphabetic, or "Other" characters.
Punctuation seems to break it, and I can't seem to figure out why. If punctuation is involved, the whole block starting with a punctuation sign, gets matched as "other".
For instance "hello 中国朋友 你好and welcome" correctly returns
Array (
[0] => Array
(
[type] => other
[text] => hello
)
[1] => Array
(
[type] => han
[text] => 中国朋友
)
[2] => Array
(
[type] => han
[text] => 你好
)
[3] => Array
(
[type] => alpha
[text] => and
)
[4] => Array
(
[type] => alpha
[text] => welcome
)
)
But "hello 中国朋友,你好and welcome" returns
Array
(
[0] => Array
(
[type] => alpha
[text] => hello
)
[1] => Array
(
[type] => han
[text] => 中国朋友
)
[2] => Array
(
[type] => other
[text] => ,你好and
)
[3] => Array
(
[type] => alpha
[text] => welcome
)
)
What am I missing?
Update: the problem seems to be withe the group "others" using S rather than S. Now, while S will partially fix the problem, each "other" character is captured singularly. S on the other hand will capture multiple "other" characters as a group, but it will include also Han and Alpha characters until it finds a space.
CodePudding user response:
The comma is matched with \S
because \S
matches any char but whitespace and the \S
pattern matches one or more occurrences of non-whitespace chars. It consumed all chars that \p{Han}
could match. It will also consume all chars (?<alpha>[a-z\d$asalpha] )
can match.
If you want to exclude \p{Han}
and [a-z\d$asalpha]
from \S
use
(?<han>\p{Han} )|(?<alpha>[a-z\d$asalpha] )|(?<other>[^\p{Han}a-z\d$asalpha\s] )
See this regex demo. [^\p{Han}a-z\d$asalpha\s]
matches one or more chars other than Chinese chars, ASCII lowercase letters, digits, additional $asalpha
chars and whitespace chars.