Home > Blockchain >  Regex split string on a char with exception for inner-string
Regex split string on a char with exception for inner-string

Time:10-17

I have a string like aa | bb | "cc | dd" | 'ee | ff' and I'm looking for a way to split this to get all the values separated by the | character with exeption for | contained in strings.

The idea is to get something like this [a, b, "cc | dd", 'ee | ff']

I've already found an answer to a similar question here : https://stackoverflow.com/a/11457952/11260467

However I can't find a way to adapt it for a case with multiple separator characters, is there someone out here which is less dumb than me when it come to regular expressions ?

CodePudding user response:

This is easier if you match the parts (not split). Patterns are greedy by default, they will consume as many characters as possible. This allows to define more complex patterns for the quoted string before providing a pattern for an unquoted token:

$subject = '[ aa | bb | "cc | dd" | \'ee | ff\' ]';

$pattern = <<<'PATTERN'
(
    (?:[|[]|^) # after | or [ or string start
    \s*
    (?<token> # name the match
        "[^"]*" # string in double quotes
        |
        '[^']*'  # string in single quotes
        |
        [^\s|]  # non-whitespace 
    )
    \s*
)x
PATTERN;

preg_match_all($pattern, $subject, $matches);
var_dump($matches['token']);

Output:

array(4) {
  [0]=>
  string(2) "aa"
  [1]=>
  string(2) "bb"
  [2]=>
  string(9) ""cc | dd""
  [3]=>
  string(9) "'ee | ff'"
}

Hints:

  1. The <<<'PATTERN' is called HEREDOC syntax and cuts down on escaping
  2. I use () as pattern delimiters - they are group 0
  3. Naming matches makes code a lot more readable
  4. Modifier x allows to indent and comment the pattern

CodePudding user response:

This is easily done with the (*SKIP)(*FAIL) functionality pcre offers:

(['"]).*?\1(*SKIP)(*FAIL)|\s*\|\s*

In PHP this could be:

<?php

$string = "aa | bb | \"cc | dd\" | 'ee | ff'";

$pattern = '~([\'"]).*?\1(*SKIP)(*FAIL)|\s*\|\s*~';

$splitted = preg_split($pattern, $string);
print_r($splitted);
?>

And would yield

Array
(
    [0] => aa
    [1] => bb
    [2] => "cc | dd"
    [3] => 'ee | ff'
)

See a demo on regex101.com and on ideone.com.

CodePudding user response:

Use

$string = "aa | bb | \"cc | dd\" | 'ee | ff'";
preg_match_all("~(?|\"([^\"]*)\"|'([^']*)'|([^|'\"] ))(?:\s*\|\s*|\z)~", $string, $matches);
print_r(array_map(function($x) {return trim($x);}, $matches[1]));

See PHP proof.

Results:

Array
(
    [0] => aa
    [1] => bb
    [2] => cc | dd
    [3] => ee | ff
)

EXPLANATION

--------------------------------------------------------------------------------
  (?|                      Branch reset group, does not capture:
--------------------------------------------------------------------------------
    \"                       '"'
--------------------------------------------------------------------------------
    (                        group and capture to \1:
--------------------------------------------------------------------------------
      [^\"]*                   any character except: '\"' (0 or more
                               times (matching the most amount
                               possible))
--------------------------------------------------------------------------------
    )                        end of \1
--------------------------------------------------------------------------------
    \"                       '"'
--------------------------------------------------------------------------------
   |                        OR
--------------------------------------------------------------------------------
    '                        '\''
--------------------------------------------------------------------------------
    (                        group and capture to \1:
--------------------------------------------------------------------------------
      [^']*                    any character except: ''' (0 or more
                               times (matching the most amount
                               possible))
--------------------------------------------------------------------------------
    )                        end of \1
--------------------------------------------------------------------------------
    '                        '\''
--------------------------------------------------------------------------------
   |                        OR
--------------------------------------------------------------------------------
    (                        group and capture to \1:
--------------------------------------------------------------------------------
      [^|'\"]                  any character except: '|', ''', '\"'
                               (1 or more times (matching the most
                               amount possible))
--------------------------------------------------------------------------------
    )                        end of \1
--------------------------------------------------------------------------------
  )                        end of grouping
--------------------------------------------------------------------------------
  (?:                      group, but do not capture:
--------------------------------------------------------------------------------
    \s*                      whitespace (\n, \r, \t, \f, and " ") (0
                             or more times (matching the most amount
                             possible))
--------------------------------------------------------------------------------
    \|                       '|'
--------------------------------------------------------------------------------
    \s*                      whitespace (\n, \r, \t, \f, and " ") (0
                             or more times (matching the most amount
                             possible))
--------------------------------------------------------------------------------
   |                        OR
--------------------------------------------------------------------------------
    \z                       the end of the string
--------------------------------------------------------------------------------
  )                        end of grouping

CodePudding user response:

It's interesting that there are so many ways to construct a regular expression for this problem. Here is another that is similar to @Jan's answer.

(['"]).*?\1\K| *\| *

PCRE Demo

(['"]) # match a single or double quote and save to capture group 1
.*?    # match zero or more characters lazily
\1     # match the content of capture group 1
\K     # reset the starting point of the reported match and discard
       # any previously-consumed characters from the reported match
|      # or
\ *    # match zero or more spaces
\|     # match a pipe character
\ *    # match zero or more spaces

Notice that the part before the pipe character ("or") serves merely to move the engine's internal string pointer to just past the closing quote or a quoted substring.

  • Related