I have a text, contains HTML tags something like:
text = <p>Some text</p> <h1>Some text</h1> ....
soup = BeautifulSoup(text)
I parsed this text using BeautifulSoup
. I would like to extract every sentence with corresponding text and tag. I tried:
for sent in soup:
print(sent.text) <- ok
print(sent.tag) <- **not ok since NavigableString does not has tag attribute**
I also tried soup.find_all()
and stuck at the same point: I have access to text but not original tag.
CodePudding user response:
Instead of tag
use name
to get the elements tag name:
for tag in soup.find_all():
print(tag.text, tag.name)
Use the parameter 'html.parser'
to avoid behavior of standard parser lxml
that will slightly reshape the structure and wraps partial HTML in <html>
and <body>
Example
from bs4 import BeautifulSoup
html = '''<p>Some text</p><h1>Some text</h1>'''
soup = BeautifulSoup(html, 'html.parser')
for tag in soup.find_all():
print(tag.text, tag.name)
Output
Some text p
Some text h1