Home > Software engineering >  Find and exclude html-tags as whole words in negative lookbehind with regex
Find and exclude html-tags as whole words in negative lookbehind with regex

Time:02-03

I basically try to find all paragraphs (in javascript/jquery) in a text, that are not yet wrapped in a set of defined html-tags:

p|h1|h2|h3|h4|h5|h6|blockquote|img|table|iframe

My current regex (https://regex101.com/r/O4i2hP/1) already matches paragraphs and excludes the defined tags

(. ?(?<![</(p|h1|h2|h3|h4|h5|h6|blockquote|img|table|iframe)>]$))(\n|$) /gm

but I just don't get, how to just match whole tags only.

The problem is:

(p|h1|h2|h3|h4|h5|h6|blockquote|img|table|iframe)> matches a single character in the list (p|h123456blockquteimgafr)> (case sensitive)

Thus, as you can see from the example, code that is wrapped in tags such as <strong>TEXT</strong> is also excluded.

I tried different things such as word boundaries \bword\b, but didn't get it working. I hope you can help. Thx

CodePudding user response:

This will do it.

^(?!<(p|h1|h2|h3|h4|h5|h6|blockquote|img|table|iframe) ?>.</\1>).$

CodePudding user response:

I now found a working approach. The tags should be wrapped in groups rather than in character classes. The following works for me:

(. ?(?<!(<\/)(p|h1|h2|h3|h4|h5|h6|blockquote|img|table|iframe)(>)$))(\n|$) /gm

see also: https://regex101.com/r/DC5msM/1

  • Related