Can regex find last occurrence of </div> before <custom-tag>?-CodePudding

I have a full page HTML and need to find all HTML code between

<w-block-content><span><div>

and

</div></span></w-block-content>

Please note that

the elements might have properties
the HTML might be formatted or not - there might be extra (empty) lines
I do not have control of the code between above tags
so inside might be unlimited number of </div elements. But only the last one is the right one in terms of boundary that specifies the correct output of regex

I have working javascript regex match

(?:<w-block-content.* data-block-content-id=\"
96e80afb-afa0-4e46-bfb7-34b80da76112\"[\s\S]*?<div>)
([\s\S]*?)(?:<\/div>[\s\S]*?<\/span>[\s\S]*?<\/w-block-content>)

that captures what I need but there must not be inside. So I need to create regex where I ignore all </div>'s but the last one before </span></w-block-content>

regex I used for testing is here https://regex101.com/r/jekZhr/2

I know that regex is not the best tool for treating XML / HTML but I need to know if such regex is possible to create or I need to change structure of data.

UPDATE

I was asked to simplify regex match. The current structure of HTML is

<w-block-content element. And inside this one is
<span element. And inside this one is

these are "just" wrappers for the content I need to get a as group out of regex.

I simplified the partially working regex to https://regex101.com/r/jekZhr/3

(?:<w-block-content.*[\s\S]*?<div>)
  ([\s\S]*?)
(?:<\/div>[\s\S]*?<\/span>[\s\S]*?<\/w-block-content>)

But I do not think it can be any simpler. I am not regex guru so I am asking here. My thoughts were

I need to find w-block-content element
then I need to find div element
then I am interested in everything in between these and
ending of div span w-block-content

I am using .*[\s\S]*? in case someone adds extra space or enter in HTML.

CodePudding user response：

As already commented, regex isn't a general purpose tool -- in fact it's a specific tool that matches patterns in a string. Having said that here's a regex solution that will match everything after the first <div> up to </w-block-content>. From there find the last index of </div> and .slice() it.

RegExp

/(?<=<w-block-content[\s\S]*?<div[\s\S]*?>)
[\s\S]*?
(?=<\/w-block-content>)/g

regex101

Explanation

A look behind: (?<=...) must precede the match, but will not be included in the match itself.

A look ahead: (?=...) must proceed the match, but will not be included in the match itself.

Segment	Description
(?<=<w-block-content[\s\S]?<div[\s\S]?>)	Find if literal "`<w-block-content`", then anything, then literal "`<div`", then anything, then literal "`>`" is before whatever is matched. Do not include it in the match.
[\s\S]*?	Match anything
(?=<\/w-block-content>)	Find if literal "`</w-block-content>`" is after whatever is matched. Do not include it in the match.

Example

const rgx = /(?<=<w-block-content[\s\S]*?<div[\s\S]*?>)[\s\S]*?(?=<\/w-block-content>)/g;

const str = document.querySelector("main").innerHTML;

const A = str.match(rgx)[0];

const idx = A.lastIndexOf("</div>");

const X = A.slice(0, idx);

console.log(X);

<main>
  <w-block-content id="A">
    CONTENT OF #A
    <span id="B">
      CONTENT OF #B
      <div id="C">
        <div>CONTENT OF #C</div>
        <div>CONTENT OF #C</div>
      </div>
      CONTENT OF #B
    </span>
    CONTENT OF #A
  </w-block-content>
</main>

CodePudding user response：

Here's the regex that worked for me, when applied to the example you provided; I've broken it out to three separate lines for visual clarity, and presumably you'd combine them back into one line or something:

(?<=<w-block-content[^>]*>\s*<span[^>]*>\s*<div[^>]*>)
[\s\S]*?
(?=<\/div>\s*<\/span>\s*<\/w-block-content>)

I don't think you need to use capture groups () in this case. If you're using a look-behind (?<=) and a look-ahead (?=) for your boundaries-finding (both of which are non-capturing), then you can just let the entire match be the content that you want to find.

I added this answer because I didn't see the other answers using [^>] (= negated character class) to allow the tag strings to be open-ended in accepting additional attributes without entirely skipping any enforcement of tag closure, which I think is a cleaner and safer approach.

I'm admittedly not a JavaScript guy here, so: today I learned that JavaScript regex-matching doesn't support single-line mode (/s), so you have to do those [\s\S] things as a work-around, instead of just .. What a pain that must be for you JavaScript folks... sorry.