Regex to find a specific string that is not in a HTML attribute-CodePudding

My case is: I have a string with HTML elements:

<a href="something specific_string" title="testing">This is a text and "specific_string"</a>

I need a Regex to match only the one that is not in a HTML attribute.

This is my current Regex, it works but it gives a false positive when the string is wrapped by double quotes

((?!\"[\w\s]*)specific_string(?![\w\s]*\"))

I have tried the following Regex:

((?!\"[\w\s]*)specific_string(?![\w\s]*\"))

It works but it gives a false positive when the string is wrapped by double quotes

CodePudding user response：

if you want to get what's inside the tag you might be trying to use the split() tool; to cut the string every >" or "<" basically like this:

let string = "<a href='something specific_string' title='testing'>This is a text and 'specific_string'</a>";

string = string.split('>');
string = string[1].split('<');

console.log(string)

So, when you want to manipulate it, just use position 0 of the string. Is not regex like u wnat, but is an idea

CodePudding user response：

Though it can suffice in simple cases, you should know it's often said that RegExp is ill-suited for parsing HTML, and depending on environment you could be better off using more robust techniques. (There's http://htmlparsing.com/ dedicated to the topic but yet it doesn't discuss JS.)

That said, the following works in Chrome 107 and Node 16.13.

(s=>s.match(/(?<=>[^<]*|^[^<]*)specific_string/))
('<a href="something specific_string" title="testing">This is a text and "specific_string"</a>')

It uses look-behind. In lieu of that you could use /(>[^<]*|^[^<]*)(specific_string)/ and compensate index/lengths to get the position of a match...

As you answer in a comment that you'll replace in user-provided HTML, I encourage you to consider security implications (namely XSS).

Back on the topic of parsing HTML w/o RegExp we obviously have the techniques in a web browser and I couldn't stop myself writing a quick and dirty textNode replacer in web JS, working in Chrome 107:

((html, fun) => {
  const el = document.createElement('body')
  el.innerHTML = html
  const X = new XPathEvaluator, R = X.evaluate('//*[text()]', el)
  const A = []; for (let n; n = R.iterateNext();) A.push(n) // mutating el while iterating XPathResult is illegal
  for (let n of A) fun(n)
  return el.innerHTML})
('<a href="something specific_string" title="testing">This is a text and "specific_string"</a>',
  n => n.innerHTML = n.innerHTML
    .replace(/specific_string/, '<b>replaced</b>'))