Home > other >  Regex: How to select everything but a specified regex pattern
Regex: How to select everything but a specified regex pattern

Time:12-22

I am trying to create a Regex that is able to select everything in a text but a specified pattern.

As you can see here: https://regex101.com/r/kFJFVi/2

The pattern of the text I want to ignore is this one <([^>] ?)([^>]*?)>(.*?)<\/\1>. I try to use some strategies but no success so far.

Based on the question For example: ^.*(<([^>] ?)([^>]*?)>(.*?)<\/\1>)?.*$ but this pattern selects all text and does not ignore the tags

    and their content.

    I also checked this question: but in this case

    the example base for using this regex:

    
    This is the second paragraph. It contains an ordered list: 
    <ol>
                <li>Item 1</li>
                <li>Item 2</li>
                <li>Item 3</li>
            </ol>
            This is a text after the list in the second paragraph.
            This is another part of a paragraph
            <ol>
                <li>Item 1</li>
                <li>Item 2</li>
                <li>Item 3</li>
            </ol>
            This is a text after the other list in the second paragraph.
    This is a text after the list in the second paragraph.
            This is another part of a paragraph
            <ol>
                <li>Item 1</li>
                <li>Item 2</li>
                <li>Item 3</li>
            </ol>
    test to odfjdf iofsdfsoh
    
    

    the result expected is:

    1st match

    This is the second paragraph. It contains an ordered list: 
    

    2nd match

     This is a text after the list in the second paragraph.
            This is another part of a paragraph
    

    3rd match

    This is a text after the other list in the second paragraph.
    This is a text after the list in the second paragraph.
            This is another part of a paragraph
    

    4th match:

    test to odfjdf iofsdfsoh
    

    basically, all text that is not in an HTML tag.

    CodePudding user response:

    To create a regular expression that selects everything in a text except a specified pattern, you can use a negative lookahead assertion. A negative lookahead assertion allows you to specify a pattern that should not be matched, and the regular expression will only match if the pattern is not present.

    For example, to match all text that is not contained within the HTML tags specified in your question, you can use the following regular expression:

    (?!<([^>] ?)([^>]*?)>(.*?)<\/\1>).*
    

    This regular expression will match any character (.) zero or more times (*), as long as it is not followed by ((?!...)) the specified HTML tag pattern.

    JS example:

    let input = "..."; // the input text
    let regex = /(?!<([^>] ?)([^>]*?)>(.*?)<\/\1>).*/g; // the regular expression
    let matches = input.match(regex); // get the matches
    

    CodePudding user response:

    If RegExp is not an absolute requirement:

    It is often easier to parse XML/HTML with DOMParser rather than RegExp. The code below creates a new document, removes the <ol> tags, and cleans ups the result.

    const p = new DOMParser();
    const doc = p.parseFromString(document.getElementById("content").innerHTML, "text/html");
    doc.querySelectorAll("body ol").forEach(n=>doc.querySelector("body").removeChild(n));
    let result = doc.querySelector("body").textContent.split("\n");
    result = result.map(str=>str.trim()).filter(str=>str.trim() !== "");
    console.log(result);
    <div id="content">
    This is the second paragraph. It contains an ordered list: 
    <ol>
                <li>Item 1</li>
                <li>Item 2</li>
                <li>Item 3</li>
            </ol>
            This is a text after the list in the second paragraph.
            This is another part of a paragraph
            <ol>
                <li>Item 1</li>
                <li>Item 2</li>
                <li>Item 3</li>
            </ol>
            This is a text after the other list in the second paragraph.
    This is a text after the list in the second paragraph.
            This is another part of a paragraph
            <ol>
                <li>Item 1</li>
                <li>Item 2</li>
                <li>Item 3</li>
            </ol>
    test to odfjdf iofsdfsoh
    </div>

    • Related