I want to remove the unwanted sub-level duplicate tags using lxml etree-CodePudding

This is the input sample text. I want to do in object based cleanup to avoid hierarchy issues

sample text

Required Output

sample text

CodePudding user response：

Markdown, itself, provides structural to extract elements inside

Using re in python, you may extract elements and recombine them.

For example:

import re


html = """<p><b><b><i><b><i><b>

<i>sample text</i>

</b></i></b></i></b></b></p>"""


regex_object = re.compile("\<(.*?)\>")
html_objects = regex_object.findall(html)
set_html = []

for obj in html_objects:
    if obj[0] != "/" and obj not in set_html:
        set_html.append(obj)


regex_text = re.compile("\>(.*?)\<")
text = [result for result in regex_text.findall(html) if result][0]

# Recombine
result = ""
for obj in set_html:
    result  = f"<{obj}>"
result  = text
for obj in set_html[::-1]:
    result  = f"</{obj}>"
    
# result = '<p><b><i>sample text</i></b></p>'

CodePudding user response：

You can use the regex library re to create a function to search for the matching opening tag and closing tag pair and everything else in between. Storing tags in a dictionary will remove duplicate tags and maintain the order they were found in (if order isn't important then just use a set). Once all pairs of tags are found, wrap what's left with the keys of the dictionary in reverse order.

import re

def remove_duplicates(string):
    
    tags = {}
    while (match := re.findall(r'\<(. )\>([\w\W]*)\<\/\1\>', string)):
        tag, string = match[0][0], match[0][1]   # match is [(group0, group1)]
        tags.update({tag: None})

    for tag in reversed(tags):
        string = f'<{tag}>{string}</{tag}>'

    return string

Note: I've used [\w\W]* as a cheat to match everything.