Home > database >  Finding text from html using BeautifulSoup
Finding text from html using BeautifulSoup

Time:05-25

I have the following .html:

<li >
                            <span><em >
                                    <div >1.29 s</div>
                                </em><em >passed</em>This is the text I want to get</span>

I need to get only the text that is outside all of the other tags (text is: This is the text I want to get).

I was trying to use this piece of code:

for el in doc.find_all('li', attrs={'class': 'print text'}):
    print(el.get_text())

But unfortunatelly it prints everything including the em tags etc.

Is there any way to do this?

Thank you!!

CodePudding user response:

Find specific li tag with class and use find_all method on em tag to get the last tag from list using indexing and next-sibling method return text

from bs4 import BeautifulSoup
soup="""<li >
        <span><em >
                <div >1.29 s</div>
            </em><em >passed</em>This is the text I want to get</span>"""

soup=BeautifulSoup(soup)
soup.find("li",class_="print text").find_all("em")[-1].next_sibling

CodePudding user response:

You could go with find(text=True, recursive=False) to get your goal.

Example
from bs4 import BeautifulSoup
soup='''<li >
        <span><em >
                <div >1.29 s</div>
            </em><em >passed</em>This is the text I want to get</span>'''

soup=BeautifulSoup(soup)

soup.find('li',class_='print text').span.find(text=True, recursive=False)

Output

This is the text I want to get

If there are multiple span in your li you could go with:

from bs4 import BeautifulSoup
soup='''<li >
        <span><em >
                <div >1.29 s</div>
            </em><em >passed</em>This is the text I want to get</span>
            <span><em >
                <div >1.50 s</div>
            </em><em >passed</em>This is the text I want to get too</span>'''

soup=BeautifulSoup(soup)

for e in soup.select('li.print.text span'):
    print(e.find(text=True, recursive=False))
Output
This is the text I want to get
This is the text I want to get too
  • Related