regex - how to search for everything else other than my pattern with lookbehinds/lookaheads-CodePudding

I've been going over lookaheads and lookbehinds, read some posts here and watched some videos

I understand some of the simpler examples but I can't find an example doing what I want.

I want to do is grab all content except the things within <>

EXAMPLE

"12345 Lorem ipsum dolor sit <12w34t5>amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim ve<45w6w7q8>niam, quis nostru<98d74fds5>d exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis a<61598>ute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa<98fds45fcx3> qui offic<7vc8952>ia deserunt mollit<598e312> anim id est laborum."

Wanted Result:

12345, Lorem, ipsum, dolor, sit, amet, consectetur, etc...

I know how to grab all content within <>, if I do something like this (?<=<)[\d\w] (?=>) https://regex101.com/r/wo22oK/1

How do I do the opposite? grab everything except the items within <>. I've tried something like this (?<!<)[\d\w] (?!>) and it almost works, but it only ignores the first and last character within the angle brackets.

https://regex101.com/r/12G5JC/1

Any help would be appreciated, thanks.

CodePudding user response：

Well one approach might be to just strip off all tags:

var input = "12345 Lorem ipsum dolor sit <12w34t5>amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim ve<45w6w7q8>niam, quis nostru<98d74fds5>d exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis a<61598>ute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa<98fds45fcx3> qui offic<7vc8952>ia deserunt mollit<598e312> anim id est laborum.";
var output = input.replace(/<.*?>/g, "");
console.log(output);

CodePudding user response：

One solution is to do the opposite and look for the text between the beginning and the first < and the between > and < and then from > to the end. We then have 8 matches to the second capturing group which contain what we want.

(^|\>)([^<>] )(<|$)

https://regex101.com/r/htRWCv/1

Another solution is to replace what we don't want: <[^<>] > with an empty string https://regex101.com/r/XtYmJW/1

"12345 Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum."

CodePudding user response：

You can do it with the replaceAll() js function, here's a quick code to test this

const regex = /<.[^<>]*>/mg;
function testRegex() {
    let text = document.getElementById('text').value;
    document.getElementById('final').innerHTML = text.replaceAll(regex, "");
}

<body>
<label for="text">Text : </label>
<input type="text" name="text" id="text">
<button onclick="testRegex()">Regex</button>
<p id="final"></p>
</body>

CodePudding user response：

If you want to do it in a single pass you can match both what you want and don't want, only capturing the former.

Using /<[^<>]*>|(\b\S \b)/g; naïvely will not work, however, as it would capture "ve<45w6w7q8>niam" as a word.

You can use /<[^<>]*>|((?:(?!<[^<>]*>)\S) )/g but it will match "ve" and "niam" as distinct words, which may not be desirable.

<script type=module>
import {notAhead, capture, either, flags, suffix} from "https://unpkg.com/[email protected]/compose-regexp.js?module"

const text = "12345 Lorem sit <12w34t5>amet, consectetur  ve<45w6w7q8>niam, quis nostru<98d74fds5>d exercitation culpa<98fds45fcx3> qui.";

const Tag = /<[^<>]*>/;
const Word = suffix(" ", notAhead(Tag), /\S/);
const regexp = flags.add("g", either(Tag, capture(Word)));

console.log(regexp);
// const regexp = /<[^<>]*>|((?:(?!<[^<>]*>)\S) )/g;

for (const [match, capture] of text.matchAll(regexp)) {
    console.log({capture, match});
}    
</script>

You can play with the solution in this Flems.io sandbox.

CodePudding user response：

Using replace may be the simplest way to do it but here is an example of how to "grab all content except the things within <>" using a negative lookahead.

It assumes that angle brackets are only ever used as tag delimiters.

[^<>] matches one or more of any character that is not an angle bracket, and the negative lookahead prevents it matching when > appears before < ahead in the string.

const text = "12345 Lorem ipsum dolor sit <12w34t5>amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim ve<45w6w7q8>niam, quis nostru<98d74fds5>d exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis a<61598>ute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa<98fds45fcx3> qui offic<7vc8952>ia deserunt mollit<598e312> anim id est laborum.";

const match = text.match(/[^<>] (?![^<]*>)/g);

console.log(match.join(''));

Alternatively, a positive lookahead can be used to only match if < or the end of string $ appears before > ahead in the string.

/[^<>] (?=[^>]*(?:<|$))/g