Home > Enterprise >  How to get as <strong> tag as title and its child element as its description using Beautiful S
How to get as <strong> tag as title and its child element as its description using Beautiful S

Time:11-08

For an HTML input below:

example = """<strong>First Title</strong><p>Content of first title</p><p>Content of first title</p><strong>Second title</strong><p>Content of second title</p></strong>"""

the output should be:

{'First Title': '<p>Content of first title</p> <p>Content of first title</p>', 'Second title': '<p>Content of second title</p>'}

and it works exactly using below code:

soup = BeautifulSoup(example, 'html.parser')

finalOutput = {}
for header in soup.find_all('strong'):
    title = header.get_text().strip()
    nextNode = header
    content = []
    while True:
        previousNode = nextNode.previous_sibling
        nextNode = nextNode.nextSibling       
        
        if not nextNode:
            finalOutput[title] = " ".join(content)
            break        
        elif isinstance(nextNode, NavigableString):
            if nextNode.strip():
                content.append(nextNode.strip())
                pass        
        elif isinstance(nextNode, Tag):
            if nextNode.name == "strong":
                finalOutput[title] = " ".join(content)
                break
            content.append(str(nextNode))

print(finalOutput)

But the problem is HTML code contains <p><strong></p> and the python code does not work for below type of example:

example = """<p><strong>First Title</strong></p><p>Content of first title</p><p>Content of first title</p><p><strong>Second title</strong></p><p>Content of second title</p></strong>"""

So I want the output like below- Text inside <strong> should be the key and value should be the text before next <strong> tag.

Expected Output:

{'First Title': '<p>Content of first title</p> <p>Content of first title</p>', 'Second title': '<p>Content of second title</p>'}

CodePudding user response:

You need to select <p> nodes with <strong> child

for header in soup.select('p strong'):
    title = header.get_text().strip()
    nextNode = header.parent
    content = []
    ...

and in the inner loop check if the nextNode child is strong

...
elif isinstance(nextNode, Tag):
    if next(nextNode.children).name == "strong":
        finalOutput[title] = " ".join(content)
        break
...
  • Related