Home > Blockchain >  get all soup above a certain div
get all soup above a certain div

Time:06-28

I have a soup of this format:

<div class = 'foo'>
  <table> </table>
  <p> </p>
  <p> </p>
  <p> </p>
  <div class = 'bar'>
  <p> </p>
  .
  .
</div>

I want to scrape all the paragraphs between the table and bar div. The challenge is that number of paragraphs between these is not constant. So I can't just get the first three paragraphs (it could be anywhere from 1-5).

How do I go about dividing this soup to get the the paragraphs. Regex seems decent at first, but it didn't work for me as later I would still need a soup object to allow for further extraction.

Thanks a ton

CodePudding user response:

You could select your element, iterate over its siblings and break if there is no p:

for t in soup.div.table.find_next_siblings():
    if t.name != 'p':
        break
    print(t)

or other way around and closer to your initial question - select the <div class = 'bar'> and find_previous_siblings('p'):

for t in soup.select_one('.bar').find_previous_siblings('p'):
    print(t)
Example
from bs4 import BeautifulSoup

html='''
<div class = 'foo'>
  <table> </table>
  <p> </p>
  <p> </p>
  <p> </p>
  <div class = 'bar'>
  <p> </p>
  .
  .
</div>
'''
soup = BeautifulSoup(html)

for t in soup.div.table.find_next_siblings():
    if t.name != 'p':
        break
    print(t)
Output
<p> </p>
<p> </p>
<p> </p>

CodePudding user response:

Try:

html='''
<div class = 'foo'>
  <table> </table>
  <p>1</p>
  <p>2</p>
  <p>3</p>
  <div class = 'bar'>
  <p>4</p>
  .
  .
</div>
        
'''

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'html.parser')

p =[x.get_text() for x in soup.select('.foo p')[0:3]]
print(p)

Output:

['1', '2', '3']
  • Related