Home > OS >  Regex capture multi-line groups
Regex capture multi-line groups

Time:08-29

I'm struggling in creating a regex to capture what's included between two keywords in a multi-line file.

In particular, consider the following file:

#%META
# date: 2022-08-27
# generated-by: Me
# id: 1
#%ENDS

#%BODY
....
#%ENDS

#%META
# date: 2022-08-27
# generated-by: Another Me
# id: 2
#%ENDS

#%BODY
....
#%ENDS

I wanted to parse what is included between the #%META and the #%ENDS keywords, if possible, without the leading #, i.e., the desired result is to capture both:

date: 2022-08-27
generated-by: Me
id: 1

and

date: 2022-08-27
generated-by: Another Me
id: 2

I come out with following regex: (?<=#%META\n)([\S\s]*?)(?=#%ENDS\n).

However this is not capable to identify the two chuncks of text to be matched as well as does not remove the leading #.

Could anyone help in that?

Thank's a lot! :)

CodePudding user response:

You might use a pattern to first capture all the parts between #%META and #%ENDS and then after process the capture group 1 values removing the leading # followed by optional spaces.

^#%META((?>\R(?!#%(?:META|ENDS)$).*) )\R#%ENDS$

Explanation

  • ^ Start of string
  • #%META Match literally
  • ( Capture group 1
    • (?> Atomic group
      • \R Match any unicode newline sequence
      • (?!#%(?:META|ENDS)$) Negative lookahead, assert that the line is not #%META or #%ENDS
      • .* Match the whole line
    • ) Close the atomic group and repeat 1 times
  • ) Close group 1
  • \R Match any unicode newline sequence
  • #%ENDS Match literally
  • $ End of string

Regex demo | PHP demo

Example

$re = '/^#%META((?>\R(?!#%(?:META|ENDS)$).*) )\R#%ENDS$/m';
$str = '#%META
# date: 2022-08-27
# generated-by: Me
# id: 1
#%ENDS

#%BODY
....
#%ENDS

#%META
# date: 2022-08-27
# generated-by: Another Me
# id: 2
#%ENDS

#%BODY
....
#%ENDS';

if (preg_match_all($re, $str, $matches)) {
    $result = array_map(function ($s) {
        return preg_replace("/^#\h*/m", "", trim($s));
    }, $matches[1]);
    var_export($result);
}

Output

array (
  0 => 'date: 2022-08-27
generated-by: Me
id: 1',
  1 => 'date: 2022-08-27
generated-by: Another Me
id: 2',
)

CodePudding user response:

You forgot to add /m modifier to regex to find all matches
Try this:

    $str = preg_replace_callback(
        '/# (. )\S/m',
        static function ($m) {
            return $m[1];
        },
        $str,
    ); // or just str_replace('# ', '', $str)
    preg_match('/((?<=#%META\n)([\S\s]*?)(?=#%ENDS\n))/m' ,$str, $m);
    var_dump($m);
  • Related