I'm trying to split a text into comma separated groups, except when the comma is in double or single quotes, or in brackets.
e.g.
a,b=456
should finda
andb=345
,a='123,456',b
should finda='123,456'
andb
a=x(1,2,3),b,c
should finda=x(1,2,3)
andb
andc
I have tried str_getcsv
and some preg_split
but I can't seem to get the right pattern.
Using the following code
function test($n, $a,$b) {
echo "Test $n";
if ( $a===$b ) echo "=<span style='color:green'>CORRECT ************************</span>";
else echo "=<span style='color:red'>WRONG</span>";
echo "<PRE>".print_r($b, true)."</PRE>";
echo "<HR>\n";
}
$t= 'lorem ipsum=123,delor=\'1,456\',sit="123,456",amet=xxx(2,3),"consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua."';
$want=["lorem ipsum=123","delor='1,456'","sit=\"123,456\"","amet=xxx(2,3)","consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua."];
and
echo "WANTED.<PRE style='color:green'>".print_r($want, true)."</PRE><HR>";
//Array
//(
// [0] => lorem ipsum=123
// [1] => delor='1,456'
// [2] => sit="123,456"
// [3] => amet=xxx(2,3)
// [4] => consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.
//)
test("1 explode", $want, explode(",", $t));
// Test 1 explode=WRONG
// Array
// (
// [0] => lorem ipsum=123
// [1] => delor='1
// [2] => 456'
// [3] => sit="123
// [4] => 456"
// [5] => amet=xxx(2
// [6] => 3)
// [7] => "consectetur adipiscing elit
// [8] => sed do eiusmod tempor incididunt ut labore et dolore magna aliqua."
// )
test("2 str_getcsv", $want, str_getcsv($t, ",", "'"));
// Test 2 str_getcsv=WRONG
// Array
// (
// [0] => lorem ipsum=123
// [1] => delor='1
// [2] => 456'
// [3] => sit="123
// [4] => 456"
// [5] => amet=xxx(2
// [6] => 3)
// [7] => "consectetur adipiscing elit
// [8] => sed do eiusmod tempor incididunt ut labore et dolore magna aliqua."
// )
test("3 str_getcsv", $want, str_getcsv($t, ",", "\""));
// Test 2 str_getcsv=WRONG
// Array
// (
// [0] => lorem ipsum=123
// [1] => delor='1
// [2] => 456'
// [3] => sit="123
// [4] => 456"
// [5] => amet=xxx(2
// [6] => 3)
// [7] => "consectetur adipiscing elit
// [8] => sed do eiusmod tempor incididunt ut labore et dolore magna aliqua."
// )
test("4 preg_split", $want, preg_split("/,/", $t));
// Test 4 preg_split=WRONG
// Array
// (
// [0] => lorem ipsum=123
// [1] => delor='1
// [2] => 456'
// [3] => sit="123
// [4] => 456"
// [5] => amet=xxx(2
// [6] => 3)
// [7] => "consectetur adipiscing elit
// [8] => sed do eiusmod tempor incididunt ut labore et dolore magna aliqua."
// )
I've lost a huge amount of time searching and trying different patterns - I'm sure I would have written a string parser quicker than this - but perhaps - can someone give me a good pattern to work through this?
I've put a sample test on https://onlinephp.io/c/3f4d3 to run this code
Thanks
CodePudding user response:
I suggest using
preg_match_all('~(?:\'[^\']*\'|"[^"]*"|(\((?:[^()] |(?1))*\))|[^\'",]) ~', $text, $matches)
Or, if there can be escape sequences inside the quoted substrings:
preg_match_all('~(?:\'[^\'\\\\]*(?:\\\\.[^\'\\\\]*)*\'|"[^"\\\\]*(?:\\\\.[^"\\\\]*)*"|(\((?:[^()] |(?1))*\))|[^\'",]) ~s', $text, $matches)
See the regex demo.
Details:
(?:
- start of a non-capturing group (acting as a container here):\'[^\'\\\\]*(?:\\\\.[^\'\\\\]*)*\'|
- a string between single quotes with escape sequences support, or"[^"\\\\]*(?:\\\\.[^"\\\\]*)*"|
- a string between double quotes with escape sequences support, or(\((?:[^()] |(?1))*\))|
- a string between two paired nested parentheses[^\'",]
- a char other than'
,"
and,
)
- one or more sequences.