Home > Software engineering >  Regex matching with exclusions of findings
Regex matching with exclusions of findings

Time:12-09

How to find all strings between < and > but excluding some special tags like b,i,ul,ol,li,p. Is there a shorter solution to the following?

while ($html =~ /<(\w )>/g) { 
  print "found $1\n" if $1 ne 'b' && $1 ne 'ul' && $1 ne 'p' ...
}

Thank you for any hint.

CodePudding user response:

You can use

while ($html =~ /<(?!(?:b|ul|p)>)(\w )>/g) { 
  print "found $1\n" 
}

See the regex demo. Details:

  • < - a < char
  • (?!(?:b|ul|p)>) - a negative lookahead that fails the match if, immediately to the right of the current location, there is b, ul or p followed with a > char
  • (\w ) - Capturing group 1: one or more word chars
  • > - a > char.

CodePudding user response:

Can use a library, and Mojo::DOM makes it easy

use Mojo::DOM;

my $dom = Mojo::DOM->new($html);

for ( $dom->find(':not(b,i,ul,ol,li,p)')->each ) {
    say
}

Now you also have the HTML parsed and can process it as needed at will.

  • Related