Home > Enterprise >  Selectors to scrape lists between two elements with defined words
Selectors to scrape lists between two elements with defined words

Time:12-01

Trying to scrape parts of Tables of Contents (toc), ideally with SoupSieve:

<html>
<dl>
#many rows
<dd>4.2.1. Drivers</dd>
<dd>4.2.1.1. itemD1</dd>
<dd>4.2.1.2. itemD2</dd>
<dd>4.2.1.3. itemD3</dd>
<dd>4.2.2. Constraints</dd>
<dd>4.2.2.1. itemC1</dd>
#many more rows
</dl>
</html>

Note that the items I need are NOT children/descendants of 4.2.1. Drivers, only look like they are because of the numbering.

Now, the elements I need to scrape are those between the elements Drivers and Constraints. It's not always 3 of them - it may be 0 or 3 or 5, depends. Later on in my code I am using pandas to output these elements into individual cells in .csv.

I've tried things like this:

def get_drivers():
    data.append({
        'url': url,
        'type': 'driver',
        'list': [x.get_text(strip=True) for x in toc.select('dd:-soup-contains-own("Drivers") ~ dd')]
    })

... but this just gives me all the elements from Drivers to the end of the document, often dozens of elements that I don't need.

Question: how can I get selectors to start selecting after Drivers and stop selecting at Constraints?

CodePudding user response:

You can absolutely do this with css selectors. Use :-soup-contains and :not, along with general sibling combinator (~) and type selector (dd) to filter out what comes after each (i.e. subtract Constraints onwards from Drivers onwards

from bs4 import BeautifulSoup as bs

html = '''<html>
<dl>
#many rows
<dd>4.2.1. Drivers</dd>
<dd>4.2.1.1. itemD1</dd>
<dd>4.2.1.2. itemD2</dd>
<dd>4.2.1.3. itemD3</dd>
<dd>4.2.2. Constraints</dd>
<dd>4.2.2.1. itemC1</dd>
#many more rows
</dl>
</html>'''
soup = bs(html, 'lxml')
filtered = [i.text for i in soup.select(
    'dd:-soup-contains(" Drivers") ~ dd:not(dd:-soup-contains(" Constraints"), dd:-soup-contains(" Constraints") ~ dd)')]

I guess a loop might work as well though less preferable IMO:

from bs4 import BeautifulSoup as bs

html = '''<html>
<dl>
#many rows
<dd>4.2.1. Drivers</dd>
<dd>4.2.1.1. itemD1</dd>
<dd>4.2.1.2. itemD2</dd>
<dd>4.2.1.3. itemD3</dd>
<dd>4.2.2. Constraints</dd>
<dd>4.2.2.1. itemC1</dd>
#many more rows
</dl>
<dl>
<dd>Error</dd>
</dl>
</html>'''
soup = bs(html, 'lxml')
start = soup.select_one('dd:-soup-contains(" Drivers")   dd')
next_node = start

while True:
    if not next_node:
        break
    if 'Constraints' in next_node.text:
        break
    print(next_node.text)
    next_node = next_node.find_next('dd')
  • Related