Home > Software design >  How to exclude a word from regex
How to exclude a word from regex

Time:12-29

I have a regex that works. However I want it to drop matches that have a specific word.

/\<meta[^\>] (http\-equiv[^\>] ?refresh[^\>] ?(?<!\-)(?<!\d)[0-9]\d*[^\>] ?url[^\>] ?|(?<!\-)(?<!\d)[0-9]\d*[^\>] ?url[^\>] ?http\-equiv[^\>] ?refresh[^\>] ?)\/?\>/is

This matches the following: (http-equiv and url in any order)

  1. <meta http-equiv="refresh" content="21;URL='http://example.com/'" />
  2. <meta content="21;URL='http://example.com/'" http-equiv="refresh" />

I want to exclude any url that has ?PageSpeed=noscript

a. <meta content="21;URL='http://example.com/?PageSpeed=noscript'" http-equiv="refresh" /> b. <meta content="21;URL='http://example.com/segment?PageSpeed=noscript&var=value'" http-equiv="refresh" />

Any ideas are much appreciated. Thanks.

CodePudding user response:

You may use the DOM Parser instead of regex.

<?php

$meta = '<meta content="21;URL=\'http://example.com/\'" http-equiv="refresh" /> <meta content="21;URL=\'http://example.com/?PageSpeed=noscript\'" http-equiv="refresh" />';

$dom = new DOMDocument;
$dom->loadHTML($meta);
$noPageScripts = [];

foreach ($dom->getElementsByTagName('meta') as $tag) {
  $content = $tag->getAttribute('content');
  // Match the URL
  preg_match('/URL=["\']?([^"\'>] )["\']?/i',$content,$matches);

  if($tag->getAttribute('http-equiv') && isset($matches[1]) && stripos($matches[1],'?PageSpeed=noscript') === false) {
    $noPageScripts[] = [
      'originalTag' => $dom->saveHTML($tag),
      'url' => $matches[1]
    ];
  }
}

var_dump($noPageScripts);

Here's the fiddle

CodePudding user response:

In my idea I rewrote the whole pattern for better performance but a bit different. Basically add a negative lookahead to prevent matching the disallowed stuff at some point where the most matching is already done, eg I put it after http -> http(?!\S*?pagespeed=noscript)

The \S*? matches any amount of non whitespace characters lazily. See the SO regex faq.

And the full pattern I tried around with:

/<meta\s(?=[^><]*?http-equiv[^\w><] refresh)[^><]*?url=[\s\'\"]*(http(?!\S*?pagespeed=noscript)[^><\s\"\']*)[^><]*>/i

Another addition, is to use a positive lookahead for matching http-equiv... to be independent of order. Similar this regex pattern which I put a long time ago on PHP.net in the comments.

  • Related