Home > OS >  Extract string between HTML tags with RegEx where the open one has attribute in it
Extract string between HTML tags with RegEx where the open one has attribute in it

Time:01-14

This is about Python's re module. Related to the example here I would like to extract everything between the main tag.

<main attr="value">
<foo>bar</foo>
</main>

The expected output

<foo>bar</foo>

My problem while building the regex pattern is the attribute in the opening tag:

<main attr="value">
     ^^^^^^^^^^^^^

I'm not sure how to express this with regex. Without the attribute (<main>) this regex do work:

(?s)<main>(. ?)<\/main>

I assume it is something with .* but I didn't get it. How can I ignore the string between <main and > in the first line?

This question is not about HTML but about a specific regex problem. The HTML part is just for illustration. I use this in unittests. I'm aware that this isn't stable in productive use. But I'm the producer of that HTML so I know what is comming there. I won't blow my code or dependencies to parse HTML. I have good reasons to do it that way.

CodePudding user response:

The following code extract everything between a main tag if it contain or not an open attribute and only match for the main tag.

Using re lib :

import re

text = "<main attr='value'><foo>bar</foo></main>"

match = re.search(r"(?s)<main(?: [^>] )?>(. ?)</main>", text)

if match:
    print(match.group(1))
  • Related