I have to scrape a div and it has 2 tags inside it <img> and <sup> i need to content of-CodePudding

Sample html code:

<div>
Hello everyone how are you
<sup>Hello hi</sup>
<figure>Blah Blah<img /><figure>
</div>

I tried using decompose() function in BeautifulSoup but it also destroys the sup tag. Can anyone help me out?

CodePudding user response：

To get text of the <sup> tag:

from bs4 import BeautifulSoup


html_doc = """\
<div>
Hello everyone how are you
<sup>Hello hi</sup>
<figure>Blah Blah<img /></figure>
</div>"""

soup = BeautifulSoup(html_doc, "html.parser")

print(soup.sup.text)

Prints:

Hello hi

To remove the <img /> tag:

soup.img.extract()
print(soup.div)

Prints:

<div>
Hello everyone how are you
<sup>Hello hi</sup>
<figure>Blah Blah</figure>
</div>

CodePudding user response：

from bs4 import BeautifulSoup
html_doc = """\
<div>
Hello everyone how are you
<sup>Hello hi</sup>
<figure>Blah Blah<img /></figure>
</div>"""

soup = BeautifulSoup(html_doc,'lxml')
a = soup.find('div')
b = a.find('sup').text

print(b)

Sorry if something isnt right but i am on the phone and i cant test it out. And you need to do pip install lxml and at the file.html put the file or the website