It looks quite easy, but I haven't managed to find a solution.
I tried other proposed solutions, like: span.clear()
but didn't do it.
Web's structure:
<div class="details">
<h2>Public function</h2>
<div class="token">
<h2>Name person</h2>
<h3>Name person</h3>
<p>
<span>NO</span>NO</p>
<p>
<span>Time of Death:</span>13:38:00</p>
Result:
Time of Death: 13:38:00
Desired result:
13:38:00
My code:
whole_section = soup.find('div', {'class':"token"}) # Access to whole section
name_person = whole_section.h2.text # Select person's name, inside "h2" tag.
time_decease = whole_section.h3.next_sibling.next_sibling.next_sibling.next_sibling.text # Because ther's no tag, I'd to use "next_sibling".
CodePudding user response:
I wouldn't really ever recommend traversing the DOM by repeatedly trying to get the next sibling - in my experience, every time you do this it makes your script more and more prone to breakages for the smallest changes in the source HTML.
Instead, find the parent <p></p>
you're after by using a lambda
function to filter based on the contents of the <p></p>
itself (the 'Time of Death:'
string, specifically); then loop through the child elements of that <p></p>
element and remove the <span></span>
to extract what you're after:
html = '''<div >
<h2>Public function</h2>
<div >
<h2>Name person</h2>
<h3>Name person</h3>
<p>
<span>NO</span>NO</p>
<p>
<span>Time of Death:</span>13:38:00</p>
</div>
</div>'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'html.parser')
whole_section = soup.find('div', {'class':"token"}) # Access to whole section
name_person = whole_section.h2.text # Select person's name, inside "h2" tag.
time_decease = whole_section.find(lambda element: element.name == 'p' and 'Time of Death:' in element.text)
for span in time_decease.find_all('span'):
span.decompose()
print(name_person)
print(time_decease.text)
CodePudding user response:
you can try this:
from bs4 import BeautifulSoup
soup = BeautifulSoup(
"""
<div >
<h2>Public function</h2>
<div >
<h2>Name person</h2>
<h3>Name person</h3>
<p>
<span>NO</span>NO
</p>
<span title="Time of Death:">13:38:00</span>
</div>
""", "xml")
print(soup.select_one("span[title*=Time]").text)