I'm using bs4 to scrap a document which has a format like this, and only want all the a tag elements above text2. How can I do so?
<h1>text1</h1>
<a href="link">link</a>
<h1>text2</h1>
<a href="link"></a>
If I turn soup into string and split, not sure I can turn it back to soup and I need to use the soup.find_all('a')
afterwards.
CodePudding user response:
try with soup.find_all_previous()
from bs4 import BeautifulSoup
soup = BeautifulSoup("""
<h1>text1</h1>
<a href="link">link</a>
<h1>text2</h1>
<a href="link"></a>""", "html.parser")
print(soup.find("h1", text="text2").find_all_previous())
[<a href="link">link</a>, <h1>text1</h1>]