I am trying to use Beautiful Soup to print the elements of a div. It is a bit hard to explain, so I have simplified it. Let me know if you need more clarification :) The div is structured as such:
<div>
<div ></div>
<div ></div>
<div ></div>
<div ></div>
<div ></div>
<div ></div>
<div ></div>
</div>
I am trying to return a list with lists. Each chunked list should contain heading, and the info divs until the next heading. For example, it would look like this: [['heading', 'info'], ['heading', 'info', 'info']...]
As such, I tried to do this:
findAllDivs = container.find_all('div')
myList = []
for i in findAllDivs:
if i['class'][0] == 'heading':
try:
if innerList:
myList.append(innerList)
except:
pass
innerList = []
innerList.append(i)
elif i['class'][0] == 'info':
innerList.append(i)
This works, however it does not return the last heading, info
list.
CodePudding user response:
Select all the headers, iterate over them and their find_next_siblings()
and break if info not in its class list:
for h in soup.div.select('.heading'):
d = [h.text]
for i in h.find_next_siblings():
if 'info' not in i.get('class'):
break
d.append(i.text)
data.append(d)
Example
from bs4 import BeautifulSoup
html = '''
<div>
<div >head1</div>
<div >info1</div>
<div >head2</div>
<div >info2.1</div>
<div >info2.2</div>
<div >head3</div>
<div >info3</div>
</div>
'''
soup = BeautifulSoup(html)
data = []
for h in soup.div.select('.heading'):
d = [h.text]
for i in h.find_next_siblings():
if 'info' not in i.get('class'):
break
d.append(i.text)
data.append(d)
data
Output
[['head1', 'info1'], ['head2', 'info2.1', 'info2.2'], ['head3', 'info3']]