I have a full page HTML and need to find all HTML code between
<w-block-content><span><div>
and
</div></span></w-block-content>
Please note that
- the elements might have properties
- the HTML might be formatted or not - there might be extra (empty) lines
- I do not have control of the code between above tags
- so inside might be unlimited number of </div elements. But only the last one is the right one in terms of boundary that specifies the correct output of regex
I have working javascript regex match
(?:<w-block-content.* data-block-content-id=\"
96e80afb-afa0-4e46-bfb7-34b80da76112\"[\s\S]*?<div>)
([\s\S]*?)(?:<\/div>[\s\S]*?<\/span>[\s\S]*?<\/w-block-content>)
that captures what I need but there must not be inside. So I need to create regex where I ignore all </div>
's but the last one before </span></w-block-content>
regex I used for testing is here https://regex101.com/r/jekZhr/2
I know that regex is not the best tool for treating XML / HTML but I need to know if such regex is possible to create or I need to change structure of data.
UPDATE
I was asked to simplify regex match. The current structure of HTML is
- <w-block-content element. And inside this one is
- <span element. And inside this one is
these are "just" wrappers for the content I need to get a as group out of regex.
I simplified the partially working regex to https://regex101.com/r/jekZhr/3
(?:<w-block-content.*[\s\S]*?<div>)
([\s\S]*?)
(?:<\/div>[\s\S]*?<\/span>[\s\S]*?<\/w-block-content>)
But I do not think it can be any simpler. I am not regex guru so I am asking here. My thoughts were
- I need to find w-block-content element
- then I need to find div element
- then I am interested in everything in between these and
- ending of div span w-block-content
I am using .*[\s\S]*?
in case someone adds extra space or enter in HTML.
CodePudding user response:
As already commented, regex isn't a general purpose tool -- in fact it's a specific tool that matches patterns in a string. Having said that here's a regex solution that will match everything after the first <div>
up to </w-block-content>
. From there find the last index of </div>
and .slice()
it.
RegExp
/(?<=<w-block-content[\s\S]*?<div[\s\S]*?>)
[\s\S]*?
(?=<\/w-block-content>)/g
Explanation
A look behind: (?<=
...)
must precede the match, but will not be included in the match itself.
A look ahead: (?=
...)
must proceed the match, but will not be included in the match itself.
Segment | Description |
---|---|
(?<=<w-block-content[\s\S]*?<div[\s\S]*?>) |
Find if literal "<w-block-content ", then anything, then literal "<div ", then anything, then literal "> " is before whatever is matched. Do not include it in the match. |
[\s\S]*? |
Match anything |
(?=<\/w-block-content>) |
Find if literal "</w-block-content> " is after whatever is matched. Do not include it in the match. |
Example
const rgx = /(?<=<w-block-content[\s\S]*?<div[\s\S]*?>)[\s\S]*?(?=<\/w-block-content>)/g;
const str = document.querySelector("main").innerHTML;
const A = str.match(rgx)[0];
const idx = A.lastIndexOf("</div>");
const X = A.slice(0, idx);
console.log(X);
<main>
<w-block-content id="A">
CONTENT OF #A
<span id="B">
CONTENT OF #B
<div id="C">
<div>CONTENT OF #C</div>
<div>CONTENT OF #C</div>
</div>
CONTENT OF #B
</span>
CONTENT OF #A
</w-block-content>
</main>
CodePudding user response:
Here's the regex that worked for me, when applied to the example you provided; I've broken it out to three separate lines for visual clarity, and presumably you'd combine them back into one line or something:
(?<=<w-block-content[^>]*>\s*<span[^>]*>\s*<div[^>]*>)
[\s\S]*?
(?=<\/div>\s*<\/span>\s*<\/w-block-content>)
I don't think you need to use capture groups ()
in this case. If you're using a look-behind (?<=)
and a look-ahead (?=)
for your boundaries-finding (both of which are non-capturing), then you can just let the entire match be the content that you want to find.
I added this answer because I didn't see the other answers using [^>]
(= negated character class) to allow the tag strings to be open-ended in accepting additional attributes without entirely skipping any enforcement of tag closure, which I think is a cleaner and safer approach.
I'm admittedly not a JavaScript guy here, so: today I learned that JavaScript regex-matching doesn't support single-line mode (/s
), so you have to do those [\s\S]
things as a work-around, instead of just .
. What a pain that must be for you JavaScript folks... sorry.