PHP Regex preg_replace function finds and replaces only the first and last of 3 matches, not the mid-CodePudding

I have the following javascript-excerpt-as-text:

for (let orange of oranges) {

  for (let apple of apples) {

    for (let banana of bananas) {

      obfuscatedArray[i] = obfuscatedArray[i].split('').reverse().join('');
      obfuscatedArray[i] = window.atob(obfuscatedArray[i]);

    }

  }

}

from which I would like to remove the excess newlines at the bottom:

for (let orange of oranges) {

  for (let apple of apples) {

    for (let banana of bananas) {

      obfuscatedArray[i] = obfuscatedArray[i].split('').reverse().join('');
      obfuscatedArray[i] = window.atob(obfuscatedArray[i]);
    }
  }
}

I have written this regex:

`/(;|})(\n(\h*)) }/`

in the following PHP:

$myString = preg_replace('/(;|})(\n(\h*)) }/', "\$1\n\$3}", $myString);

but, for reasons I can't ascertain, the newline between the first closing curly brace and the second isn't being removed.

I have tested the regex in Regex101 (ie. outside PHP's preg_replace() function) and it still only finds two matches instead of three.

I really can't understand where I'm going wrong with the regex?

CodePudding user response：

You consume (i.e. match and add matched text to the overall match memory buffer and advance the regex index) the ; or } and a } after one or more newlines. Once a substring is consumed, the next match cannot consume the same text.

You may use lookarounds to override this:

preg_replace('~([;}])\h*\R(?=\h*(?:\R\h*) })~', '$1', $text)
preg_replace('~(?<=[;}])\h*\R(?=\h*(?:\R\h*) })~', '', $text)
preg_replace('~[;}]\K\h*\R(?=\h*(?:\R\h*) })~', '', $text)

See the regex demo (or this regex demo).

Note in the last two examples, there is no need to use a $1 backreference as there is no capturing group in the pattern, it was replaced with a non-consuming lookbehind ((?<=[;}])) or \K was used to clear the current match memory buffer.

Details:

([;}]) - capturing group #1: a ; or } chars
(?<=[;}]) - a positive lookbehind that requires ; or } to appear immediately to the left of the current location
[;}]\K - a ; or } and then the \K operator "loses" the text matched (the ; or } are removed from the match memory buffer)
\h* - zero or more horizontal whitespaces
\R - a line break sequence
(?=\h*(?:\R\h*) }) - a positive lookahead that matches a location that is immediately followed with
- \h* - zero or more horizontal whitespaces
- (?:\R\h*) - one or more occurrences of a line break sequence and zero or more horizontal whitespaces
- } - a } char.

CodePudding user response：

Your pattern is matching the last line with the } and can not be matched again to take part in the next match attempt.

If you want to replace all "empty" lines in between, you change your pattern to assert a newline followed by horizontal whitespace chars to the right followed by } to not consume it.

(;|})(\n(\h*)) (?=\n\h*})

In the replacement use group 1 $1

Regex demo

The pattern can also be written to using \K omitting the first capture group, then omit the other superfluous capture groups, a character class [;}] instead of an alternation and using \R to match any unicode newline sequence instead of only a newline:

[;}]\K(?:\R\h*)*(?=\R\h*})

In the replacement use an empty string.

Regex demo

As you want to match all "empty" lines in between, you can replace (?:\R\h*)* with \s* shortening the pattern to:

[;}]\K\s*(?=\R\h*})

Regex demo

The pattern matches:

[;}] Match either ; or }
\K Forget what is matched so far (clear the current match buffer)
\s* Match optional whitespace chars
(?=\R\h*}) Positive lookahead, assert from the current position a newline, optional horizontal whitespace chars and }