Home > database >  Excluding inner tags from string using Regex
Excluding inner tags from string using Regex

Time:08-08

I have the following text:

If there would be more <div>matches<div>in</div> string</div>, you will merge them to one

How do I make a JS regex that will produce the following text?

If there would be more <div>matches in string</div>, you will merge them to one

As you can see, the additional <div> tag has been removed.

CodePudding user response:

I would use a DOMParser to parseFromString into the more fluent HTMLDocument interface to solve this problem. You are not going to solve it well with regex.

const htmlDocument = new DOMParser().parseFromString("this <div>has <div>nested</div> divs</div>");

htmlDocument.body.childNodes; // NodeList(2): [ #text, div ]

From there, the algorithm depends on exactly what you want to do. Solving the problem exactly as you described to us isn't too tricky: recursively walk the DOM tree; remember whether you've seen a tag yet; if so, exclude the node and merge its children into the parent's children.

In code:

const simpleExampleHtml = `<div>Hello, this is <p>a paragraph</p> and <div>some <div><div><div>very deeply</div></div> nested</div> divs</div> that should be eliminated</div>`

// Parse into an HTML document
const doc = new DOMParser().parseFromString(exampleHtml, "text/html").body;

// Process a node, removing any tags that have already been seen
const processNode = (node, seenTags = []) => {
  // If this is a text node, return it
  if (node.nodeName === "#text") {
    return node.cloneNode()
  }
  // If this node has been seen, return its children
  if (seenTags.includes(node.tagName)) {
    // flatMap flattens, in case the same node is repeatedly nested
    // note that this is a newer JS feature and lacks IE11 support: https://caniuse.com/?search=flatMap
    return Array.from(node.childNodes).flatMap(child => processNode(child, seenTags))
  }
  // If this node has not been seen, process its children and return it
  const newChildren = Array.from(node.childNodes).flatMap(child => processNode(child, [...seenTags, node.tagName]))
  // Clone the node so we don't mutate the original
  const newNode = node.cloneNode()
  // We can't directly assign to node.childNodes - append every child instead
  newChildren.forEach(child => newNode.appendChild(child))
  return newNode
}

// resultBody is an HTML <body> Node with the desired result as its childNodes
const resultBody = processNode(doc);
const resultText = resultBody.innerHTML
// <div>Hello, this is <p>a paragraph</p> and some very deeply nested divs that should be eliminated</div> 

But make sure you know EXACTLY what you want to do!

There's lots of potential complications you could face with data that's more complex than your example. Here are some examples where the simple approach may not give you the desired result.

<!-- nodes where nested identical children are meaningful -->
<ul>
  <li>Nested list below</li>
  <li>
    <ul>
      <li>Nested list item</li>
    </ul>
  </li>
</ul>

<!-- nested nodes with classes or IDs -->

<span>A span with <span >nested spans <span id="DeeplyNested" form-container">
  <form>
    <div >
      <label for="username">Username</label>
      <input type="text" name="username" />
    </div>
    <div 
      <label for="password">Password</label>
      <input type="text" name="password" />
    </div>
  </form>
</div>

CodePudding user response:

Simple approach without using Regex by using p element of html and get its first div content as innerText(exclude any html tags) and affect it to p, finally get content but this time with innerHTML:

let text = 'If there would be more <div>mathces <div>in</div> string</div>, you will merge them to one';
const p = document.createElement('p');
p.innerHTML = text;
p.querySelector('div').innerText = p.querySelector('div').innerText;
console.log(p.innerHTML);

  • Related