I have a soup of this format:
<div class = 'foo'>
<table> </table>
<p> </p>
<p> </p>
<p> </p>
<div class = 'bar'>
<p> </p>
.
.
</div>
I want to scrape all the paragraphs between the table and bar div. The challenge is that number of paragraphs between these is not constant. So I can't just get the first three paragraphs (it could be anywhere from 1-5).
How do I go about dividing this soup to get the the paragraphs. Regex seems decent at first, but it didn't work for me as later I would still need a soup object to allow for further extraction.
Thanks a ton
CodePudding user response:
You could select your element, iterate over its siblings
and break
if there is no p
:
for t in soup.div.table.find_next_siblings():
if t.name != 'p':
break
print(t)
or other way around and closer to your initial question - select the <div class = 'bar'>
and find_previous_siblings('p')
:
for t in soup.select_one('.bar').find_previous_siblings('p'):
print(t)
Example
from bs4 import BeautifulSoup
html='''
<div class = 'foo'>
<table> </table>
<p> </p>
<p> </p>
<p> </p>
<div class = 'bar'>
<p> </p>
.
.
</div>
'''
soup = BeautifulSoup(html)
for t in soup.div.table.find_next_siblings():
if t.name != 'p':
break
print(t)
Output
<p> </p>
<p> </p>
<p> </p>
CodePudding user response:
Try:
html='''
<div class = 'foo'>
<table> </table>
<p>1</p>
<p>2</p>
<p>3</p>
<div class = 'bar'>
<p>4</p>
.
.
</div>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'html.parser')
p =[x.get_text() for x in soup.select('.foo p')[0:3]]
print(p)
Output:
['1', '2', '3']