Getting specific part of HTML document using BeautifulSoup in Python-CodePudding

I'm trying to find a way to get a specific part of text out of an HTML document that doesn't really have any tags using python. The HTML code looks like this.

<h1 >
    <small> Text1 </small>
    Text2
</h1>

I'm trying to get the part that says text 2 without getting the part that says text1. I'm currently using Beatiful Soup in my python code so it would be handy if there was a solution using this library.

Thanks in advance for helping!

CodePudding user response：

Try using split and soup.find

from bs4 import BeautifulSoup


html = """<h1 >
<small> Text1 </small>
Text2
</h1>"""


soup = BeautifulSoup(html)
txt = soup.find('h1').text.split()[1]  # -> 'Text2'

CodePudding user response：

There are multiple approaches to get your goal e.g. via contents:

soup.h1.contents[-1].strip()

or next_sibling:

soup.h1.small.next_sibling.strip()

or find_all(text=True, recursive=False):

''.join(soup.h1.find_all(text=True, recursive=False)).strip()

and many more ...

Example

from bs4 import BeautifulSoup

html = '''
<h1 >
    <small> Text1 </small>
    Text2
</h1>
'''
soup = BeautifulSoup(html, 'html.parser')
    
print(soup.h1.contents[-1].strip())
print(soup.h1.small.next_sibling.strip())
print(''.join(soup.h1.find_all(text=True, recursive=False)).strip())

Output

Text2
Text2
Text2