Home > other >  Return element in div until classname changes bs4
Return element in div until classname changes bs4

Time:05-31

I am trying to use Beautiful Soup to print the elements of a div. It is a bit hard to explain, so I have simplified it. Let me know if you need more clarification :) The div is structured as such:

<div>
    <div ></div>
    <div ></div>
    <div ></div>
    <div ></div>
    <div ></div>
    <div ></div>
    <div ></div>
</div>

I am trying to return a list with lists. Each chunked list should contain heading, and the info divs until the next heading. For example, it would look like this: [['heading', 'info'], ['heading', 'info', 'info']...]

As such, I tried to do this:

findAllDivs = container.find_all('div')

myList = []
for i in findAllDivs:

    if i['class'][0] == 'heading':
        
        try:
            if innerList:
                myList.append(innerList)
        except:
            pass

        innerList = []
        innerList.append(i)

    elif i['class'][0] == 'info':
        innerList.append(i)

This works, however it does not return the last heading, info list.

CodePudding user response:

Select all the headers, iterate over them and their find_next_siblings() and break if info not in its class list:

for h in soup.div.select('.heading'):
    d = [h.text]
    for i in h.find_next_siblings():
        if 'info' not in i.get('class'):
            break
        d.append(i.text)
    data.append(d)    
Example
from bs4 import BeautifulSoup

html = '''
<div>
    <div >head1</div>
    <div >info1</div>
    <div >head2</div>
    <div >info2.1</div>
    <div >info2.2</div>
    <div >head3</div>
    <div >info3</div>
</div>
'''
soup = BeautifulSoup(html)

data = []

for h in soup.div.select('.heading'):
    d = [h.text]
    for i in h.find_next_siblings():
        if 'info' not in i.get('class'):
            break
        d.append(i.text)
    data.append(d)

data
Output
[['head1', 'info1'], ['head2', 'info2.1', 'info2.2'], ['head3', 'info3']]
  • Related