For an HTML input below:
example = """<strong>First Title</strong><p>Content of first title</p><p>Content of first title</p><strong>Second title</strong><p>Content of second title</p></strong>"""
the output should be:
{'First Title': '<p>Content of first title</p> <p>Content of first title</p>', 'Second title': '<p>Content of second title</p>'}
and it works exactly using below code:
soup = BeautifulSoup(example, 'html.parser')
finalOutput = {}
for header in soup.find_all('strong'):
title = header.get_text().strip()
nextNode = header
content = []
while True:
previousNode = nextNode.previous_sibling
nextNode = nextNode.nextSibling
if not nextNode:
finalOutput[title] = " ".join(content)
break
elif isinstance(nextNode, NavigableString):
if nextNode.strip():
content.append(nextNode.strip())
pass
elif isinstance(nextNode, Tag):
if nextNode.name == "strong":
finalOutput[title] = " ".join(content)
break
content.append(str(nextNode))
print(finalOutput)
But the problem is HTML code contains <p><strong></p>
and the python code does not work for below type of example:
example = """<p><strong>First Title</strong></p><p>Content of first title</p><p>Content of first title</p><p><strong>Second title</strong></p><p>Content of second title</p></strong>"""
So I want the output like below- Text inside <strong>
should be the key and value should be the text before next <strong>
tag.
Expected Output:
{'First Title': '<p>Content of first title</p> <p>Content of first title</p>', 'Second title': '<p>Content of second title</p>'}
CodePudding user response:
You need to select <p>
nodes with <strong>
child
for header in soup.select('p strong'):
title = header.get_text().strip()
nextNode = header.parent
content = []
...
and in the inner loop check if the nextNode
child is strong
...
elif isinstance(nextNode, Tag):
if next(nextNode.children).name == "strong":
finalOutput[title] = " ".join(content)
break
...