Home > Software design >  PHP Regex preg_replace function finds and replaces only the first and last of 3 matches, not the mid
PHP Regex preg_replace function finds and replaces only the first and last of 3 matches, not the mid

Time:12-23

I have the following javascript-excerpt-as-text:

for (let orange of oranges) {

  for (let apple of apples) {

    for (let banana of bananas) {

      obfuscatedArray[i] = obfuscatedArray[i].split('').reverse().join('');
      obfuscatedArray[i] = window.atob(obfuscatedArray[i]);

    }

  }

}

from which I would like to remove the excess newlines at the bottom:

for (let orange of oranges) {

  for (let apple of apples) {

    for (let banana of bananas) {

      obfuscatedArray[i] = obfuscatedArray[i].split('').reverse().join('');
      obfuscatedArray[i] = window.atob(obfuscatedArray[i]);
    }
  }
}

I have written this regex:

/(;|})(\n(\h*)) }/

in the following PHP:

$myString = preg_replace('/(;|})(\n(\h*)) }/', "\$1\n\$3}", $myString);

but, for reasons I can't ascertain, the newline between the first closing curly brace and the second isn't being removed.

I have tested the regex in Regex101 (ie. outside PHP's preg_replace() function) and it still only finds two matches instead of three.

I really can't understand where I'm going wrong with the regex?

CodePudding user response:

You consume (i.e. match and add matched text to the overall match memory buffer and advance the regex index) the ; or } and a } after one or more newlines. Once a substring is consumed, the next match cannot consume the same text.

You may use lookarounds to override this:

preg_replace('~([;}])\h*\R(?=\h*(?:\R\h*) })~', '$1', $text)
preg_replace('~(?<=[;}])\h*\R(?=\h*(?:\R\h*) })~', '', $text)
preg_replace('~[;}]\K\h*\R(?=\h*(?:\R\h*) })~', '', $text)

See the regex demo (or this regex demo).

Note in the last two examples, there is no need to use a $1 backreference as there is no capturing group in the pattern, it was replaced with a non-consuming lookbehind ((?<=[;}])) or \K was used to clear the current match memory buffer.

Details:

  • ([;}]) - capturing group #1: a ; or } chars
  • (?<=[;}]) - a positive lookbehind that requires ; or } to appear immediately to the left of the current location
  • [;}]\K - a ; or } and then the \K operator "loses" the text matched (the ; or } are removed from the match memory buffer)
  • \h* - zero or more horizontal whitespaces
  • \R - a line break sequence
  • (?=\h*(?:\R\h*) }) - a positive lookahead that matches a location that is immediately followed with
    • \h* - zero or more horizontal whitespaces
    • (?:\R\h*) - one or more occurrences of a line break sequence and zero or more horizontal whitespaces
    • } - a } char.

CodePudding user response:

Your pattern is matching the last line with the } and can not be matched again to take part in the next match attempt.

If you want to replace all "empty" lines in between, you change your pattern to assert a newline followed by horizontal whitespace chars to the right followed by } to not consume it.

(;|})(\n(\h*)) (?=\n\h*})

In the replacement use group 1 $1

Regex demo


The pattern can also be written to using \K omitting the first capture group, then omit the other superfluous capture groups, a character class [;}] instead of an alternation and using \R to match any unicode newline sequence instead of only a newline:

[;}]\K(?:\R\h*)*(?=\R\h*})

In the replacement use an empty string.

Regex demo


As you want to match all "empty" lines in between, you can replace (?:\R\h*)* with \s* shortening the pattern to:

[;}]\K\s*(?=\R\h*})

Regex demo

The pattern matches:

  • [;}] Match either ; or }
  • \K Forget what is matched so far (clear the current match buffer)
  • \s* Match optional whitespace chars
  • (?=\R\h*}) Positive lookahead, assert from the current position a newline, optional horizontal whitespace chars and }
  • Related