I have the following .html:
<li >
<span><em >
<div >1.29 s</div>
</em><em >passed</em>This is the text I want to get</span>
I need to get only the text that is outside all of the other tags (text is: This is the text I want to get).
I was trying to use this piece of code:
for el in doc.find_all('li', attrs={'class': 'print text'}):
print(el.get_text())
But unfortunatelly it prints everything including the em tags etc.
Is there any way to do this?
Thank you!!
CodePudding user response:
Find specific li
tag with class
and use find_all
method on em
tag to get the last tag from list using indexing and next-sibling
method return text
from bs4 import BeautifulSoup
soup="""<li >
<span><em >
<div >1.29 s</div>
</em><em >passed</em>This is the text I want to get</span>"""
soup=BeautifulSoup(soup)
soup.find("li",class_="print text").find_all("em")[-1].next_sibling
CodePudding user response:
You could go with find(text=True, recursive=False)
to get your goal.
Example
from bs4 import BeautifulSoup
soup='''<li >
<span><em >
<div >1.29 s</div>
</em><em >passed</em>This is the text I want to get</span>'''
soup=BeautifulSoup(soup)
soup.find('li',class_='print text').span.find(text=True, recursive=False)
Output
This is the text I want to get
If there are multiple span
in your li
you could go with:
from bs4 import BeautifulSoup
soup='''<li >
<span><em >
<div >1.29 s</div>
</em><em >passed</em>This is the text I want to get</span>
<span><em >
<div >1.50 s</div>
</em><em >passed</em>This is the text I want to get too</span>'''
soup=BeautifulSoup(soup)
for e in soup.select('li.print.text span'):
print(e.find(text=True, recursive=False))
Output
This is the text I want to get
This is the text I want to get too