I'm using BeautifulSoup and have found an element in my document like so:
<p><a id="_Toc374204393"></a><a id="_Toc374204469"></a>Hershey's<sup>®</sup> makes yummy chocolate</p>
and I'd like to extract
Hershey's<sup>®</sup> makes yummy chocolate
I know I can take this item and grab its .contents
, and then re-join the text if it doesn't include an <a>
, but that seems like a super hacky way to do it. How else might I get this text? Using methods like get_text()
return me the text but without the <sup>
tags which I'd like to preserve.
CodePudding user response:
You can use next_siblings
:
from bs4 import BeautifulSoup
html = """<p><a id="_Toc374204393"></a><a id="_Toc374204469"></a>Hershey's<sup>®</sup> makes yummy chocolate</p>"""
soup = BeautifulSoup(html, "html.parser")
print(
"".join(str(x) for x in soup.find("a", id="_Toc374204469").next_siblings)
)
Output:
Hershey's<sup>®</sup> makes yummy chocolate
CodePudding user response:
The best solution I've found thusfar is to use the bleach
package. With that, I can just do
import bleach
bleach.clean(my_html, tags=['sup'], strip=True)
This wasn't working for me at first because my html was a BeautifulSoup Tag
object, and bleach wants the html. So I just did str(Tag)
to get the html representation and fed that to bleach.