I have stored a text in html format by scraping a website, so it only contains headers and paragraphs.
From my html structured like this:
<h2> header one </h2>
<p> some text </p>
<p> some more text </p>
<h2> header two </h2>
.
.
<h2> header three </h2>
I need to get separate datasets looking like this:
dataset1:
<h2> header one </h2>
<p> some text </p>
<p> some more text </p>
dataset2:
<h2> header two </h2>
<p> some text </p>
<p> some more text </p>
I thought about parsing the content into a text and separate using regex separator but I cannot be sure the text inside the header tags are not also inside the paragraph tags.
Is there any way to store the subsequent data from a given tag up until the next tag of the same type like this?
CodePudding user response:
Not sure if this is the most efficient way, but get all the <h2>
tags. Then iterate through those. When you do that, you can get all the .next_siblings
and iterate through those. When you hit the next <h2>
tag, break that loop. If you get a <p>
tag, dump that into a list (or what ever you'd like).
So this will create a list of lists, where each element in the root is your partitioned dataset:
html = '''<h2> header one </h2>
<p> some text </p>
<p> some more text </p>
<h2> header two </h2>
<p> some text2 </p>
<p> some more text2 </p>
<p> and some more text2 </p>
<h2> header three </h2>
<p> some text3 </p>'''
from bs4 import BeautifulSoup
data = []
soup = BeautifulSoup(html, 'html.parser')
h2s = soup.find_all('h2')
for h2 in h2s:
temp_data = [h2]
for tag in h2.next_siblings:
if tag.name == 'h2':
break
elif tag.name == 'p':
temp_data.append(tag)
data.append(temp_data)
Output:
for item in data:
print(f'{item}')
[<h2> header one </h2>, <p> some text </p>, <p> some more text </p>]
[<h2> header two </h2>, <p> some text2 </p>, <p> some more text2 </p>, <p> and some more text2 </p>]
[<h2> header three </h2>, <p> some text3 </p>]