I am trying to get all the <p>
that come after <h2>
.
I know how to do this in case I have only one <p>
after <h2>
, but not in case I have multiple <p>
.
Here's an example of the webpage:
<h2>Heading Text1</h2>
<p>Paragraph1</p>
<p>Paragraph2</p>
<h2>Heading Text2</h2>
<p>Paragraph3</p>
<p>Paragraph4</p>
<p>Paragraph5</p>
....
I need to get all paragraphs in relation to their headings, e.g. Paragraphs 1 and 2 that are related to Heading Text1.
I'm trying that using BeautifulSoup
with Python
, been trying for days, also googling.
How can this be done?
CodePudding user response:
This is how I would do it, I will get all the h2
, p
tags and iterate through them saving the last h2
tag content and tying it to the paragraphs next to it.
from bs4 import BeautifulSoup
html = '''
<h2>Heading Text1</h2>
<p>Paragraph1</p>
<p>Paragraph2</p>
<h2>Heading Text2</h2>
<p>Paragraph3</p>
<p>Paragraph4</p>
<p>Paragraph5</p>
'''
soup = BeautifulSoup(html, 'html.parser')
dict_to_save = {}
# find all the 'h2' and 'p' tags
for tag in soup(['h2','p']):
# if 'h2' tag save it into a variable named header
if tag.name == 'h2':
header = tag.text.strip()
# if not 'h2' tag add this paragraph to the last header
else:
dict_to_save[header] = dict_to_save.get(header, []) [tag.text.strip()]
print(dict_to_save)
{'Heading Text1': ['Paragraph1', 'Paragraph2'],
'Heading Text2': ['Paragraph3', 'Paragraph4', 'Paragraph5']}
CodePudding user response:
You could get your goal while working with a dict
and .find_previous()
- Iterate all <p>
, find its previous <h2>
and set it as key in your dict
, than simply append the texts to its list
:
d = {}
for p in soup.select('p'):
if d.get(p.find_previous('h2').text) == None:
d[p.find_previous('h2').text]= []
d[p.find_previous('h2').text].append(p.text)
Example
from bs4 import BeautifulSoup
html = '''
<h2>Heading Text1</h2>
<p>Paragraph1</p>
<p>Paragraph2</p>
<h2>Heading Text2</h2>
<p>Paragraph3</p>
<p>Paragraph4</p>
<p>Paragraph5</p>
'''
soup = BeautifulSoup(html)
d = {}
for p in soup.select('p'):
if d.get(p.find_previous('h2').text) == None:
d[p.find_previous('h2').text]= []
d[p.find_previous('h2').text].append(p.text)
d
Output
{'Heading Text1': ['Paragraph1', 'Paragraph2'],
'Heading Text2': ['Paragraph3', 'Paragraph4', 'Paragraph5']}