Home > Enterprise >  Regex to find a specific string that is not in a HTML attribute
Regex to find a specific string that is not in a HTML attribute

Time:11-10

My case is: I have a string with HTML elements:

<a href="something specific_string" title="testing">This is a text and "specific_string"</a>

I need a Regex to match only the one that is not in a HTML attribute.

This is my current Regex, it works but it gives a false positive when the string is wrapped by double quotes

((?!\"[\w\s]*)specific_string(?![\w\s]*\"))

I have tried the following Regex:

((?!\"[\w\s]*)specific_string(?![\w\s]*\"))

It works but it gives a false positive when the string is wrapped by double quotes

CodePudding user response:

if you want to get what's inside the tag you might be trying to use the split() tool; to cut the string every >" or "<" basically like this:

let string = "<a href='something specific_string' title='testing'>This is a text and 'specific_string'</a>";

string = string.split('>');
string = string[1].split('<');

console.log(string)

So, when you want to manipulate it, just use position 0 of the string. Is not regex like u wnat, but is an idea

CodePudding user response:

Though it can suffice in simple cases, you should know it's often said that RegExp is ill-suited for parsing HTML, and depending on environment you could be better off using more robust techniques. (There's http://htmlparsing.com/ dedicated to the topic but yet it doesn't discuss JS.)

That said, the following works in Chrome 107 and Node 16.13.

(s=>s.match(/(?<=>[^<]*|^[^<]*)specific_string/))
('<a href="something specific_string" title="testing">This is a text and "specific_string"</a>')

It uses look-behind. In lieu of that you could use /(>[^<]*|^[^<]*)(specific_string)/ and compensate index/lengths to get the position of a match...


As you answer in a comment that you'll replace in user-provided HTML, I encourage you to consider security implications (namely XSS).

Back on the topic of parsing HTML w/o RegExp we obviously have the techniques in a web browser and I couldn't stop myself writing a quick and dirty textNode replacer in web JS, working in Chrome 107:

((html, fun) => {
  const el = document.createElement('body')
  el.innerHTML = html
  const X = new XPathEvaluator, R = X.evaluate('//*[text()]', el)
  const A = []; for (let n; n = R.iterateNext();) A.push(n) // mutating el while iterating XPathResult is illegal
  for (let n of A) fun(n)
  return el.innerHTML})
('<a href="something specific_string" title="testing">This is a text and "specific_string"</a>',
  n => n.innerHTML = n.innerHTML
    .replace(/specific_string/, '<b>replaced</b>'))
  • Related