Home > Back-end >  php pcre regular expressions without delimiter
php pcre regular expressions without delimiter

Time:10-01

Background / Intro

Generally, regular expressions in PHP, e.g. for preg_match(), begin and end with a delimiter character like /. Personally I often used @ instead.

Instead of a delim character one can also use opening and closing brackets like (), {} or [].

The delimiter char needs to be escaped, when it should be interpreted as a regular character. E.g. preg_match('@^\w \@\w \.\w $@', $mail) needs to escape the '@' as '\@'.

The function preg_quote(string $str, ?string $delimiter) accepts null for the $delimiter, which suggests that regular expressions can be written in a way where we don't have to worry about the delimiter.

With () it seems we don't have to worry about the delimiter, because '(' and ')' already need to be escaped anyway.

With [] and {} it is a bit different. While an orphan '[' causes an error, an orphan ']', '{' or '}' does not.

Motivation

For package development, I would like to provide methods where the user can specify regex fragments, without worrying about the choice of delimiter.

E.g. if I would internally use '/' as the delimiter, then the user (calling code) would need to escape '/' in the provided regex fragment. If I use '@', they can leave '/' unescaped, but need to escape the '@'. If I use null / '()', they would not need to escape anything - I think.

Here is an imaginary example. Please don't ask what ->setFragment() does, all you need to know is that the second parameter receives a regex fragment (that is, a snippet that can be inserted into a regex).

// If regex like '/../':
$system->setFragment('email', '\w @\w \.\w ');  // nothing escaped.
$system->setFragment('dir', '\w (\/\w )*');  // '/' escaped.

// If regex like '@..@':
$system->setFragment('email', '\w \@\w \.\w ');  // '@' escaped.
$system->setFragment('dir', '\w (/\w )*');  // nothing escaped.

// If regex like '(..)':
$system->setFragment('email', '\w @\w \.\w ');  // nothing escaped.
$system->setFragment('dir', '\w (/\w )*');  // nothing escaped.

Another example, more akin to what I am actually doing:

function buildMessageRegex(string $message, ?string $delimiter, array $regex_fragments = []): string {
  $quoted_message = preg_quote($message, $delimiter);
  $regex_body = strtr($quoted_message, $replacements);
  return $delimiter !== null
    ? $delimiter . '^' . $regex_body . '$' . $delimiter
    : '(^' . $regex_body . '$)';
}

// By using $delimiter === null, we don't have to escape '/' or '@'.
$regex = buildMessageRegex('Mail: %mail, Dir: %dir.', null, [
  '%mail' => '\w @\w \.\w ',
  '%dir' => '\w (/\w )*',
]);

Question

It seems that () is the only way to write a regex where I don't have to worry about the delimiter, and can call preg_quote($str, null) with null as the delimiter.

Is this assumption correct?

If so, I could always use () as the delimiter and would not need to provide a delimiter option in the methods.

Or am I missing something?

Scope

I am not sure if this problem/question is specific to PHP, or applies more generally to PCRE anywhere it is used (in Perl, I assume?).

I am personally interested in the PHP case, but I think it is worthwhile mentioning in a good answer how this applies outside of PHP.

CodePudding user response:

Unfortunately, your logic that ( and ) always need to be escaped is not true. They don't normally need to be escaped inside [], but they do if () are the delimiters.

For example:

preg_match('/[(]/', "foo(bar", $match);

is valid, but

preg_match('([(])', "foo(bar", $match);

gets a "No ending matching delimiter ')' found" error.

So if you use () as delimiters, the caller will need to escape those characters inside [], which isn't normally required.

  • Related