I'm trying to find a way to get a specific part of text out of an HTML document that doesn't really have any tags using python. The HTML code looks like this.
<h1 >
<small> Text1 </small>
Text2
</h1>
I'm trying to get the part that says text 2
without getting the part that says text1
. I'm currently using Beatiful Soup in my python code so it would be handy if there was a solution using this library.
Thanks in advance for helping!
CodePudding user response:
Try using split and soup.find
from bs4 import BeautifulSoup
html = """<h1 >
<small> Text1 </small>
Text2
</h1>"""
soup = BeautifulSoup(html)
txt = soup.find('h1').text.split()[1] # -> 'Text2'
CodePudding user response:
There are multiple approaches to get your goal e.g. via contents
:
soup.h1.contents[-1].strip()
or next_sibling
:
soup.h1.small.next_sibling.strip()
or find_all(text=True, recursive=False)
:
''.join(soup.h1.find_all(text=True, recursive=False)).strip()
and many more ...
Example
from bs4 import BeautifulSoup
html = '''
<h1 >
<small> Text1 </small>
Text2
</h1>
'''
soup = BeautifulSoup(html, 'html.parser')
print(soup.h1.contents[-1].strip())
print(soup.h1.small.next_sibling.strip())
print(''.join(soup.h1.find_all(text=True, recursive=False)).strip())
Output
Text2
Text2
Text2