How do I extract html content between two elements (Python, BeautifulSoup)-CodePudding

I have stored a text in html format by scraping a website, so it only contains headers and paragraphs.

From my html structured like this:

<h2> header one </h2> 
<p> some text </p>
<p> some more text </p>
<h2> header two </h2>
.
.
<h2> header three </h2>

I need to get separate datasets looking like this:

dataset1:

<h2> header one </h2> 
<p> some text </p>
<p> some more text </p>

dataset2:

<h2> header two </h2> 
<p> some text </p>
<p> some more text </p>

I thought about parsing the content into a text and separate using regex separator but I cannot be sure the text inside the header tags are not also inside the paragraph tags.

Is there any way to store the subsequent data from a given tag up until the next tag of the same type like this?

CodePudding user response：

Not sure if this is the most efficient way, but get all the <h2> tags. Then iterate through those. When you do that, you can get all the .next_siblings and iterate through those. When you hit the next <h2> tag, break that loop. If you get a <p> tag, dump that into a list (or what ever you'd like).

So this will create a list of lists, where each element in the root is your partitioned dataset:

html = '''<h2> header one </h2> 
<p> some text </p>
<p> some more text </p>
<h2> header two </h2>
<p> some text2 </p>
<p> some more text2 </p>
<p> and some more text2 </p>
<h2> header three </h2>
<p> some text3 </p>'''


from bs4 import BeautifulSoup


data = []
soup = BeautifulSoup(html, 'html.parser')
h2s = soup.find_all('h2')

for h2 in h2s:
    temp_data = [h2]
    for tag in h2.next_siblings:
        if tag.name == 'h2':
            break
        elif tag.name == 'p':
            temp_data.append(tag)
            
    data.append(temp_data)

Output:

for item in data:
    print(f'{item}')
[<h2> header one </h2>, <p> some text </p>, <p> some more text </p>]
[<h2> header two </h2>, <p> some text2 </p>, <p> some more text2 </p>, <p> and some more text2 </p>]
[<h2> header three </h2>, <p> some text3 </p>]