Home > Software engineering >  How to extract deeply nested <p> tags using Beautiful Soup
How to extract deeply nested <p> tags using Beautiful Soup

Time:11-18

I have the content below and I am trying to understand how to extract the <p> tag copy using Beautiful Soup (I am open to other methods). As you can see the <p> tags are not both nested inside the same <div>. I gave it a shot with the following method but that only seems to work when both <p> tags are within the same container.

<div class="top-panel">
  <div class="inside-panel-0">
    <h1 class="h1-title">Some Title</h1>
  </div>
  <div class="inside-panel-0">
    <div class="inside-panel-1">
      <p> I want to extract this copy</p>
    </div>
    <div class="inside-panel-1">
      <p>I want to extract this copy</p>
    </div>
  </div>
</div>

CodePudding user response:

IIUC try

from bs4 import BeautifulSoup

html = """<div >
  <div >
    <h1 >Some Title</h1>
  </div>
  <div >
    <div >
      <p> I want to extract this copy</p>
    </div>
    <div >
      <p>I want to extract this copy</p>
    </div>
  </div>
</div>"""

soup = BeautifulSoup(html, 'lxml')
# find all the p tags that have a parent class of inside-panel-1
soup.findAll({'p': {'class': 'inside-panel-1'}})

[<p> I want to extract this copy</p>, <p>I want to extract this copy</p>]

If you want just the text then try

p_tags = soup.findAll({'p': {'class': 'inside-panel-1'}})
[elm.text for elm in p_tags]  
# -> [' I want to extract this copy', 'I want to extract this copy']

CodePudding user response:

As p tags are inside div lang-py s-code-block">from bs4 import BeautifulSoup html = """ <div > <div > <h1 > Some Title </h1> </div> <div > <div > <p> I want to extract this copy </p> </div> <div > <p> I want to extract this copy </p> </div> </div> </div> """ soup = BeautifulSoup(html, 'html.parser') # print(soup.prettify()) p_tags = soup.select('div.top-panel div[]') for p_tag in p_tags: print(p_tag.get_text(strip=True))

Output:

I want to extract this copy
I want to extract this copy
  • Related