Home > Blockchain >  trying to split a comma separated string ignoring quotes and brackets
trying to split a comma separated string ignoring quotes and brackets

Time:12-08

I'm trying to split a text into comma separated groups, except when the comma is in double or single quotes, or in brackets.

e.g.

  1. a,b=456 should find a and b=345,
  2. a='123,456',b should find a='123,456' and b
  3. a=x(1,2,3),b,c should find a=x(1,2,3) and b and c

I have tried str_getcsv and some preg_split but I can't seem to get the right pattern.

Using the following code

function test($n, $a,$b) {
    echo "Test $n";
    if ( $a===$b ) echo "=<span style='color:green'>CORRECT ************************</span>";
    else echo "=<span style='color:red'>WRONG</span>";
    echo "<PRE>".print_r($b, true)."</PRE>";
    echo "<HR>\n";
}

$t=    'lorem ipsum=123,delor=\'1,456\',sit="123,456",amet=xxx(2,3),"consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua."';
$want=["lorem ipsum=123","delor='1,456'","sit=\"123,456\"","amet=xxx(2,3)","consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua."];

and

echo "WANTED.<PRE style='color:green'>".print_r($want, true)."</PRE><HR>";
//Array
//(
//    [0] => lorem ipsum=123
//    [1] => delor='1,456'
//    [2] => sit="123,456"
//    [3] => amet=xxx(2,3)
//    [4] => consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.
//)


test("1 explode", $want, explode(",", $t));
// Test 1 explode=WRONG
// Array
// (
//     [0] => lorem ipsum=123
//     [1] => delor='1
//     [2] => 456'
//     [3] => sit="123
//     [4] => 456"
//     [5] => amet=xxx(2
//     [6] => 3)
//     [7] => "consectetur adipiscing elit
//     [8] =>  sed do eiusmod tempor incididunt ut labore et dolore magna aliqua."
// )


test("2 str_getcsv", $want, str_getcsv($t, ",", "'"));
// Test 2 str_getcsv=WRONG
// Array
// (
//     [0] => lorem ipsum=123
//     [1] => delor='1
//     [2] => 456'
//     [3] => sit="123
//     [4] => 456"
//     [5] => amet=xxx(2
//     [6] => 3)
//     [7] => "consectetur adipiscing elit
//     [8] =>  sed do eiusmod tempor incididunt ut labore et dolore magna aliqua."
// )


test("3 str_getcsv", $want, str_getcsv($t, ",", "\""));
// Test 2 str_getcsv=WRONG
// Array
// (
//     [0] => lorem ipsum=123
//     [1] => delor='1
//     [2] => 456'
//     [3] => sit="123
//     [4] => 456"
//     [5] => amet=xxx(2
//     [6] => 3)
//     [7] => "consectetur adipiscing elit
//     [8] =>  sed do eiusmod tempor incididunt ut labore et dolore magna aliqua."
// )    


test("4 preg_split", $want, preg_split("/,/", $t));
// Test 4 preg_split=WRONG
// Array
// (
//     [0] => lorem ipsum=123
//     [1] => delor='1
//     [2] => 456'
//     [3] => sit="123
//     [4] => 456"
//     [5] => amet=xxx(2
//     [6] => 3)
//     [7] => "consectetur adipiscing elit
//     [8] =>  sed do eiusmod tempor incididunt ut labore et dolore magna aliqua."
// )

I've lost a huge amount of time searching and trying different patterns - I'm sure I would have written a string parser quicker than this - but perhaps - can someone give me a good pattern to work through this?

I've put a sample test on https://onlinephp.io/c/3f4d3 to run this code

Thanks

CodePudding user response:

I suggest using

preg_match_all('~(?:\'[^\']*\'|"[^"]*"|(\((?:[^()]  |(?1))*\))|[^\'",]) ~', $text, $matches)

Or, if there can be escape sequences inside the quoted substrings:

preg_match_all('~(?:\'[^\'\\\\]*(?:\\\\.[^\'\\\\]*)*\'|"[^"\\\\]*(?:\\\\.[^"\\\\]*)*"|(\((?:[^()]  |(?1))*\))|[^\'",]) ~s', $text, $matches)

See the regex demo.

Details:

  • (?: - start of a non-capturing group (acting as a container here):
    • \'[^\'\\\\]*(?:\\\\.[^\'\\\\]*)*\'| - a string between single quotes with escape sequences support, or
    • "[^"\\\\]*(?:\\\\.[^"\\\\]*)*"| - a string between double quotes with escape sequences support, or
    • (\((?:[^()] |(?1))*\))| - a string between two paired nested parentheses
    • [^\'",] - a char other than ', " and ,
  • ) - one or more sequences.
  • Related