Home > Software design >  Matching Deep Nested Elements HTML With RegEx
Matching Deep Nested Elements HTML With RegEx

Time:12-16

I'm working on a number of HTML files and I'm trying to match a <p> tag inside a <li> inside a <ul>

For example:

<ul>
   <li>1</li>
   <li><p>2</p></li>
   <li>
      <ul>
         <li><p>3</p></li>
      </ul>
   </li>
</ul>

My goal is to match both <p> tags (2 and 3) separately with their nearest parent <li> and <ul> tags.

Here's the Regex I'm using

/<ul>.*?(<li.*?>).*?(<p.*?>.*?<\/p>)(.*?)(<\/li>)/gs

Problem happens when I try to match in an html like this:

<ul>
   <li>
      <ul>
         <li></li>
         <p>4</p>
      </ul>
   </li>
</ul>

It matches the <p> tag and the further away parent <li> and <ul> tags.

Does anyone have an idea how I can fix this?

Edit: Assuming I need to use Regex for this matching. I might end up using selectors in JS anyway like you guys suggested, but I'd still like to know if there's an easy fix for this pattern since this logic already exists in my app using Regex.

CodePudding user response:

If your goal is to fix / find bad HTML? I.e. <p> as direct descendant of <ul> is not allowed; hence regex, a better approach would likely be a simple parser.

If not; simplest would be something like document.createElement innerHTML querySelectorAll.

If using RegExp use negated <> as "delimiter" when matching tags, i.e:

<foo[^>]*>

// and

[^<]*

Though obviously not fool-proof. Quick and dirty for your case:

/<ul>[^<]*<li[^>]*>[^<]*<p[^>]*>([^<]*)/
      |       |     |
      |       |      -- ...
      |        -- not >
       -- not <

Would crash with tags inside <p> (I.e. depends on text only inside <p> ... </p>).

CodePudding user response:

This is a partial answer.

The best I got to is with /<ul>.*?(<li.*?>(?:(?!<li>).)*?<p.*?>.*?<\/p>(?:(?!<\/li>).)*<\/li>)/gs

With

<ul>
   <li>1</li>
   <li><p>2</p></li>
   <li>
      <ul>
         <li><p>3</p></li>
      </ul>
   </li>
</ul>

it gives (first one is obviously wrong)

<li>1</li> <li><p>2</p></li> and <li><p>3</p></li>

With

<ul>
   <li>
      <ul>
         <li></li>
         <p>4</p>
      </ul>
   </li>
</ul>

the result is

<li>
      <ul>
         <li></li>
         <p>4</p>
      </ul>
   </li>

Maybe someone can improve it further

CodePudding user response:

You have been warned to use regular expressions with HTML in the comments.
They are correct, the hiararchical structure means a linear pattern can not always find your desired solution.

Works with valid HTML

Assuming the HTML is valid anyway and there is only whitespace between the tags you are looking for, I have come up with this:

  \s*(<li.*>)?\s*(<p.*>.*<\/p>)\s*(<\/li>)?
  • This makes the surrounding li element optional but still captures it if it exists (at least in your examples).
  • Assuming whitespace everywhere else, so \s*
  • I have replaced .*? with .*: You do not have to write .*?, * already means "0 or more".

You can experiment with it here:
https://regex101.com/r/oyNweY/1

  • Related