Regex: How to select everything but a specified regex pattern-CodePudding

I am trying to create a Regex that is able to select everything in a text but a specified pattern.

As you can see here: https://regex101.com/r/kFJFVi/2

The pattern of the text I want to ignore is this one <([^>] ?)([^>]*?)>(.*?)<\/\1>. I try to use some strategies but no success so far.

Based on the question For example: ^.*(<([^>] ?)([^>]*?)>(.*?)<\/\1>)?.*$ but this pattern selects all text and does not ignore the tags

and their content.

I also checked this question: but in this case

the example base for using this regex:


This is the second paragraph. It contains an ordered list: 
<ol>
            <li>Item 1</li>
            <li>Item 2</li>
            <li>Item 3</li>
        </ol>
        This is a text after the list in the second paragraph.
        This is another part of a paragraph
        <ol>
            <li>Item 1</li>
            <li>Item 2</li>
            <li>Item 3</li>
        </ol>
        This is a text after the other list in the second paragraph.
This is a text after the list in the second paragraph.
        This is another part of a paragraph
        <ol>
            <li>Item 1</li>
            <li>Item 2</li>
            <li>Item 3</li>
        </ol>
test to odfjdf iofsdfsoh

the result expected is:

1st match

This is the second paragraph. It contains an ordered list:

2nd match

 This is a text after the list in the second paragraph.
        This is another part of a paragraph

3rd match

This is a text after the other list in the second paragraph.
This is a text after the list in the second paragraph.
        This is another part of a paragraph

4th match:

test to odfjdf iofsdfsoh

basically, all text that is not in an HTML tag.

CodePudding user response：

To create a regular expression that selects everything in a text except a specified pattern, you can use a negative lookahead assertion. A negative lookahead assertion allows you to specify a pattern that should not be matched, and the regular expression will only match if the pattern is not present.

For example, to match all text that is not contained within the HTML tags specified in your question, you can use the following regular expression:

(?!<([^>] ?)([^>]*?)>(.*?)<\/\1>).*

This regular expression will match any character (.) zero or more times (*), as long as it is not followed by ((?!...)) the specified HTML tag pattern.

JS example:

let input = "..."; // the input text
let regex = /(?!<([^>] ?)([^>]*?)>(.*?)<\/\1>).*/g; // the regular expression
let matches = input.match(regex); // get the matches

CodePudding user response：

If RegExp is not an absolute requirement:

It is often easier to parse XML/HTML with DOMParser rather than RegExp. The code below creates a new document, removes the <ol> tags, and cleans ups the result.

const p = new DOMParser();
const doc = p.parseFromString(document.getElementById("content").innerHTML, "text/html");
doc.querySelectorAll("body ol").forEach(n=>doc.querySelector("body").removeChild(n));
let result = doc.querySelector("body").textContent.split("\n");
result = result.map(str=>str.trim()).filter(str=>str.trim() !== "");
console.log(result);

<div id="content">
This is the second paragraph. It contains an ordered list: 
<ol>
            <li>Item 1</li>
            <li>Item 2</li>
            <li>Item 3</li>
        </ol>
        This is a text after the list in the second paragraph.
        This is another part of a paragraph
        <ol>
            <li>Item 1</li>
            <li>Item 2</li>
            <li>Item 3</li>
        </ol>
        This is a text after the other list in the second paragraph.
This is a text after the list in the second paragraph.
        This is another part of a paragraph
        <ol>
            <li>Item 1</li>
            <li>Item 2</li>
            <li>Item 3</li>
        </ol>
test to odfjdf iofsdfsoh
</div>