Home > Net >  Regex: match all parts with pattern up to the next occurrence
Regex: match all parts with pattern up to the next occurrence

Time:09-17

Let's say I have the following log file (with line endings):

[xxx] test test[xxx]foobar
more data
[xxx] more data
[xxx] other data []:foo bar
more data here
[xxx] 1234

I would like to retrieve all parts starting with [xxx] up until the next occurrence of [xxx], so the result would become (\n indicating the newline here):

$result = [
    '[xxx] test test[xxx]foobar \n more data',
    '[xxx] more data',
    '[xxx] other data []:foo bar \n more data here',
    '[xxx] 1234'
]

I came up with the regex /(\[xxx\] .*)/g but it fails to match the cases where there are multiple lines per log entry. I've tried variations like /(\[xxx\] [\s.]*)/g but to no avail.

I feel like I'm missing something obvious here. What modifiers or other syntax should I use?

CodePudding user response:

You can use either of

preg_match_all('~\[xxx].*(?:\R(?!\[xxx]).*)*~', $text, $matches)
preg_match_all('~\[xxx].*?(?=\[xxx]|\z)~s', $text, $matches)

Or - if the left hand [xxx] always appears at the start of a line

preg_match_all('~^\[xxx].*(?:\R(?!\[xxx]).*)*~m', $text, $matches)
preg_match_all('~^\[xxx].*?(?=^\[xxx]|\z)~ms', $text, $matches)

The first solution (demo) is preferable because it is more efficient (see the second regex demo).

Details:

  • ^ - start of a line
  • \[xxx] - a [xxx] string
  • .* - the rest of the line
  • (?:\R(?!\[xxx]).*)* - zero or more sequences of
    • \R(?!\[xxx]) - a line break sequence not immediately followed with [xxx]
    • .* - the rest of the line.

The ^\[xxx].*?(?=^\[xxx]|\z) regex matches [xxx] at the start of a line, then any zero or more chars as few as possible, and then either a position immediately followed with [xxx] at the start of a line or end of string.

CodePudding user response:

An alternate php solution using preg_split preg_replace with a simple regex:

$data = '[xxx] test test[xxx]foobar
more data
[xxx] more data
[xxx] other data []:foo bar
more data here
[xxx] 1234';

foreach(preg_split('/^(?=\[xxx] )/m', $data) as $el) {
    echo preg_replace('/\n(?!$)/', '\\n', $el);
}

Code Demo

Output:

[xxx] test test[xxx]foobar\nmore data
[xxx] more data
[xxx] other data []:foo bar\nmore data here
[xxx] 1234

Breakup:

  • /^(?=\[xxx] )/m: Using this regex in preg_split so that we split input text every time [xxx] appears on line start
  • /\n(?!$)/: Using this regex to replace \n from each element of split array with \\n
  • Related