Trying to scrape parts of Tables of Contents (toc), ideally with SoupSieve:
<html>
<dl>
#many rows
<dd>4.2.1. Drivers</dd>
<dd>4.2.1.1. itemD1</dd>
<dd>4.2.1.2. itemD2</dd>
<dd>4.2.1.3. itemD3</dd>
<dd>4.2.2. Constraints</dd>
<dd>4.2.2.1. itemC1</dd>
#many more rows
</dl>
</html>
Note that the items I need are NOT children/descendants of 4.2.1. Drivers, only look like they are because of the numbering.
Now, the elements I need to scrape are those between the elements Drivers and Constraints. It's not always 3 of them - it may be 0 or 3 or 5, depends. Later on in my code I am using pandas to output these elements into individual cells in .csv.
I've tried things like this:
def get_drivers():
data.append({
'url': url,
'type': 'driver',
'list': [x.get_text(strip=True) for x in toc.select('dd:-soup-contains-own("Drivers") ~ dd')]
})
... but this just gives me all the elements from Drivers to the end of the document, often dozens of elements that I don't need.
Question: how can I get selectors to start selecting after Drivers and stop selecting at Constraints?
CodePudding user response:
You can absolutely do this with css selectors. Use :-soup-contains
and :not
, along with general sibling combinator (~
) and type selector (dd
) to filter out what comes after each (i.e. subtract Constraints
onwards from Drivers
onwards
from bs4 import BeautifulSoup as bs
html = '''<html>
<dl>
#many rows
<dd>4.2.1. Drivers</dd>
<dd>4.2.1.1. itemD1</dd>
<dd>4.2.1.2. itemD2</dd>
<dd>4.2.1.3. itemD3</dd>
<dd>4.2.2. Constraints</dd>
<dd>4.2.2.1. itemC1</dd>
#many more rows
</dl>
</html>'''
soup = bs(html, 'lxml')
filtered = [i.text for i in soup.select(
'dd:-soup-contains(" Drivers") ~ dd:not(dd:-soup-contains(" Constraints"), dd:-soup-contains(" Constraints") ~ dd)')]
I guess a loop might work as well though less preferable IMO:
from bs4 import BeautifulSoup as bs
html = '''<html>
<dl>
#many rows
<dd>4.2.1. Drivers</dd>
<dd>4.2.1.1. itemD1</dd>
<dd>4.2.1.2. itemD2</dd>
<dd>4.2.1.3. itemD3</dd>
<dd>4.2.2. Constraints</dd>
<dd>4.2.2.1. itemC1</dd>
#many more rows
</dl>
<dl>
<dd>Error</dd>
</dl>
</html>'''
soup = bs(html, 'lxml')
start = soup.select_one('dd:-soup-contains(" Drivers") dd')
next_node = start
while True:
if not next_node:
break
if 'Constraints' in next_node.text:
break
print(next_node.text)
next_node = next_node.find_next('dd')