Home > Mobile >  Regex - match text not inside HTML a-tag
Regex - match text not inside HTML a-tag

Time:11-14

I'm trying to create a regex that matches "Wonder woman" as long as it is not inside an a-tag.

The regex I have so far:

(Wonder woman)?<a.*?<\/a>|(\{\S ?\})

This matches the a-tag from beginning to end (including both). I think I'm close, but I'm out of ideas.

In the following string, I want to match the word "wonder woman" (case insensitive) as long as it is not insinde an a-tag. It's the last two lines (seperated by a new-line) that I'm trying to create a regex for.

Don't match me. I'm just random text <a>Wonder woman</a>
<a>Wonder
woman</a>
<a>test</a>
<a> Wonder
Woman test test 
</a>
This is some random text that  should not be matched. 
Wonder
Woman

I also tried the following regex but it doesn't match "wonder woman" if it's on two seperate lines:

wonder woman(?![^<a>]*?<\/a>)

Any help with my regex is much appreciated.

Note: I'm not interested in replacing everything else with an emtpy string.

I want to match the specific word(s) and then insert a different word, let's say "Captain America".

CodePudding user response:

You can use a DOMParser to parse the HTML, then loop thorugh each text node and replace all occurences of 'wonder woman' with 'captain america':

const str = `Don't match me. I'm just random text <a>Wonder woman</a>
<a>Wonder
woman</a>
<a>test</a>
<a> Wonder
Woman test test 
</a>
This is some random text that  should not be matched. 
Wonder
Woman
`

function match(s){
  const parsed = new DOMParser().parseFromString(s, 'text/html')
  parsed.body.childNodes.forEach(e => {
    if(e.nodeType == 3) e.data = e.data.replace(/wonder([\r\n ]*)woman/gi, 'captain$1america')
  })
  return parsed.body.innerHTML
}

console.log(match(str))
<iframe name="sif1" sandbox="allow-forms allow-modals allow-scripts" frameborder="0"></iframe>

CodePudding user response:

/(?<!<a(\s|>)*(.(?!<\/a>))*)wonder\s woman/gis

it looks up wonder woman such that there is no <a opening tag without a </a> closing tag in front of it.

const str = `Don't match me. I'm just random text <a>Wonder woman</a>
<a> yu Wonder
woman</a>
<a>test</a>
<a> Wonder
Woman test test
</a>
This is some random text that  should not be matched.
Wonder
Woman

<a href="">
<span> value Wonder
Woman</span>
</a>
text . Wonder woman
<span></span> <a>..</a>`;

const result = str.replace(/(?<!<a(\s|>)(.(?!<\/a>))*)wonder\s woman/gis, '*****');

console.log(result);
<iframe name="sif2" sandbox="allow-forms allow-modals allow-scripts" frameborder="0"></iframe>

If there is nested <a></a> such as

<a><a>..</a>wonder woman</a>

it will be difficult to work with regex only.

CodePudding user response:

Using lookaheads and lookbehinds can do the trick:

/(?<!<a>.*)wonder\swoman(?!.*<\/a>)/gi

Be aware that lookbehinds aren't supported in Safari yet

  • Related