Home > Software engineering >  How to get text and corresponding tag with BeautifulSoup?
How to get text and corresponding tag with BeautifulSoup?

Time:02-01

I have a text, contains HTML tags something like:

text = <p>Some text</p> <h1>Some text</h1> .... 
soup = BeautifulSoup(text)

I parsed this text using BeautifulSoup. I would like to extract every sentence with corresponding text and tag. I tried:

for sent in soup:
    print(sent.text) <- ok
    print(sent.tag) <- **not ok since NavigableString does not has tag attribute**

I also tried soup.find_all() and stuck at the same point: I have access to text but not original tag.

CodePudding user response:

Instead of tag use name to get the elements tag name:

for tag in soup.find_all():
    print(tag.text, tag.name)

Use the parameter 'html.parser' to avoid behavior of standard parser lxml that will slightly reshape the structure and wraps partial HTML in <html> and <body>

Example

from bs4 import BeautifulSoup

html = '''<p>Some text</p><h1>Some text</h1>'''
soup = BeautifulSoup(html, 'html.parser')

for tag in soup.find_all():
    print(tag.text, tag.name)

Output

Some text p
Some text h1
  • Related