This is about Python's re
module. Related to the example here I would like to extract everything between the main
tag.
<main attr="value">
<foo>bar</foo>
</main>
The expected output
<foo>bar</foo>
My problem while building the regex pattern is the attribute in the opening tag:
<main attr="value">
^^^^^^^^^^^^^
I'm not sure how to express this with regex. Without the attribute (<main>
) this regex do work:
(?s)<main>(. ?)<\/main>
I assume it is something with .*
but I didn't get it. How can I ignore the string between <main
and >
in the first line?
This question is not about HTML but about a specific regex problem. The HTML part is just for illustration. I use this in unittests. I'm aware that this isn't stable in productive use. But I'm the producer of that HTML so I know what is comming there. I won't blow my code or dependencies to parse HTML. I have good reasons to do it that way.
CodePudding user response:
The following code extract everything between a main
tag if it contain or not an open attribute and only match for the main
tag.
Using re
lib :
import re
text = "<main attr='value'><foo>bar</foo></main>"
match = re.search(r"(?s)<main(?: [^>] )?>(. ?)</main>", text)
if match:
print(match.group(1))