Home > Software engineering >  How to extract all <p> with its corresponding <h2>?
How to extract all <p> with its corresponding <h2>?

Time:09-18

I am trying to get all the <p> that come after <h2>.

I know how to do this in case I have only one <p> after <h2>, but not in case I have multiple <p>.

Here's an example of the webpage:

<h2>Heading Text1</h2>

<p>Paragraph1</p>
<p>Paragraph2</p>

<h2>Heading Text2</h2>

<p>Paragraph3</p>
<p>Paragraph4</p>
<p>Paragraph5</p>
....

I need to get all paragraphs in relation to their headings, e.g. Paragraphs 1 and 2 that are related to Heading Text1.

I'm trying that using BeautifulSoup with Python, been trying for days, also googling.

How can this be done?

CodePudding user response:

This is how I would do it, I will get all the h2, p tags and iterate through them saving the last h2 tag content and tying it to the paragraphs next to it.

from bs4 import BeautifulSoup

html = '''
<h2>Heading Text1</h2>

<p>Paragraph1</p>
<p>Paragraph2</p>

<h2>Heading Text2</h2>

<p>Paragraph3</p>
<p>Paragraph4</p>
<p>Paragraph5</p>
'''

soup = BeautifulSoup(html, 'html.parser')

dict_to_save = {}

# find all the 'h2' and 'p' tags
for tag in soup(['h2','p']):
    # if 'h2' tag save it into a variable named header
    if tag.name == 'h2':
        header = tag.text.strip()

    # if not 'h2' tag add this paragraph to the last header
    else:
        dict_to_save[header] = dict_to_save.get(header, [])   [tag.text.strip()]

print(dict_to_save)
{'Heading Text1': ['Paragraph1', 'Paragraph2'],
 'Heading Text2': ['Paragraph3', 'Paragraph4', 'Paragraph5']}

CodePudding user response:

You could get your goal while working with a dict and .find_previous() - Iterate all <p>, find its previous <h2> and set it as key in your dict, than simply append the texts to its list:

d = {}
for p in soup.select('p'):
    if d.get(p.find_previous('h2').text) == None:
        d[p.find_previous('h2').text]= []
    d[p.find_previous('h2').text].append(p.text)

Example

from bs4 import BeautifulSoup

html = '''
<h2>Heading Text1</h2>

<p>Paragraph1</p>
<p>Paragraph2</p>

<h2>Heading Text2</h2>

<p>Paragraph3</p>
<p>Paragraph4</p>
<p>Paragraph5</p>
'''
soup = BeautifulSoup(html)

d = {}
for p in soup.select('p'):
    if d.get(p.find_previous('h2').text) == None:
        d[p.find_previous('h2').text]= []
    d[p.find_previous('h2').text].append(p.text)
d

Output

{'Heading Text1': ['Paragraph1', 'Paragraph2'],
 'Heading Text2': ['Paragraph3', 'Paragraph4', 'Paragraph5']}
  • Related