Home > OS >  Punctuation somehow breaks preg_match_all group capture
Punctuation somehow breaks preg_match_all group capture

Time:09-16

Consider this function

function Split_Sentence($string, $asalpha)
{
 preg_match_all("~(?<han>\p{Han} )|(?<alpha>[a-z\d$asalpha] )|(?<other>\S )~ui", $string, $out)

 foreach($out as $group_key=>$group)
 {
   if(!is_numeric($group_key))
   {  
    // discard indexed groups 
    foreach($group as $i=>$v)
    { 
     if(mb_strlen($v))
     {   
      $res[$i]=['type'=>$group_key,'text'=>$v];
     }
    }
   }
  }
  
  ksort($res);
  return $res;
}

(where $ashalpha is series of character to be matched as "alpha" no matter what)

This function is used to parse a sentence and break it into groups of Han, Alphabetic, or "Other" characters.

Punctuation seems to break it, and I can't seem to figure out why. If punctuation is involved, the whole block starting with a punctuation sign, gets matched as "other".

For instance "hello 中国朋友 你好and welcome" correctly returns

Array (
    [0] => Array
        (
            [type] => other
            [text] => hello
        )

    [1] => Array
        (
            [type] => han
            [text] => 中国朋友
        )

    [2] => Array
        (
            [type] => han
            [text] => 你好
        )

    [3] => Array
        (
            [type] => alpha
            [text] => and
        )

    [4] => Array
        (
            [type] => alpha
            [text] => welcome
        )

)

But "hello 中国朋友,你好and welcome" returns

Array
(
    [0] => Array
        (
            [type] => alpha
            [text] => hello
        )

    [1] => Array
        (
            [type] => han
            [text] => 中国朋友
        )

    [2] => Array
        (
            [type] => other
            [text] => ,你好and
        )

    [3] => Array
        (
            [type] => alpha
            [text] => welcome
        )

)

What am I missing?

Update: the problem seems to be withe the group "others" using S rather than S. Now, while S will partially fix the problem, each "other" character is captured singularly. S on the other hand will capture multiple "other" characters as a group, but it will include also Han and Alpha characters until it finds a space.

CodePudding user response:

The comma is matched with \S because \S matches any char but whitespace and the \S pattern matches one or more occurrences of non-whitespace chars. It consumed all chars that \p{Han} could match. It will also consume all chars (?<alpha>[a-z\d$asalpha] ) can match.

If you want to exclude \p{Han} and [a-z\d$asalpha] from \S use

(?<han>\p{Han} )|(?<alpha>[a-z\d$asalpha] )|(?<other>[^\p{Han}a-z\d$asalpha\s] )

See this regex demo. [^\p{Han}a-z\d$asalpha\s] matches one or more chars other than Chinese chars, ASCII lowercase letters, digits, additional $asalpha chars and whitespace chars.

  • Related