I'm using BeautifulSoup
to parse HTML and locate specific elements on a page.
Is there a way to rationalise the following attempt to pluck out a single element with a single find()
call based on both attributes of the target element and attributes of it's ancestors?
HTML
<ul >
<li>Name: Mickey Mouse</li>
<li>Height: 3ft</li>
</ul>
<ul >
<li>Rating: 5</li>
<li>Score: 6</li>
</ul>
<ul >
<li>Age: 20</li>
<li>Appearances: 100</li>
</ul>
PYTHON
ancestors = soup.find_all("ul", class_="info")
for ancestor in ancestors:
elem = ancestor.find("li", string=lambda s: s.startswith("Rating: "))
if elem: break
In other words, can I add search conditions of ancestral elements in a find()
call?
The question is a generic one about the capabilities of the find()
method, not about the specific example given above which is arbitrary.
Taking ancestral properties into account is possible using the select_one()
method - which uses CSS selectors.
For example (ignoring the need to select by the prefixed text):
soup.select_one("ul.info li")
This will return all <li>
tags who have an ancestor that is a <ul>
tag with a class value of info
.
Reading the documentation, I can't see an equivalent one-liner using the Beautiful Soup "pure" API that can do the same thing.
CodePudding user response:
You can do so, sort of, with find
. I'm not quite sure what is being suggested by "pure" API, but let's get into it.
So, first, let's start with find
. Find has many capabilities. You can filter elements by tag name, attribute properties, regex on tag name or attribute properties, or even content. You can also pass functions into find, and this is the only way to do far more advanced stuff.
from bs4 import BeautifulSoup
HTML = """
<ul >
<li>Name: Mickey Mouse</li>
<li>Height: 3ft</li>
</ul>
<ul >
<li>Rating: 5</li>
<li>Score: 6</li>
</ul>
<ul >
<li>Age: 20</li>
<li>Appearances: 100</li>
</ul>
"""
def get_ratings(el):
if el.name == 'li' and el.string.startswith("Rating: "):
parent = el.parent
if parent.name == 'ul' and 'info' in parent.attrs['class']:
return True
return False
soup = BeautifulSoup(HTML, 'html.parser')
print(soup.find(get_ratings))
Wit that said, you can also do this with CSS selectors. We can't necessarily test for the prefix of "Ratings: ", but we can test if the element contains "Ratings: " with the custom CSS selector called :-soup-contains()
:
from bs4 import BeautifulSoup
HTML = """
<ul >
<li>Name: Mickey Mouse</li>
<li>Height: 3ft</li>
</ul>
<ul >
<li>Rating: 5</li>
<li>Score: 6</li>
</ul>
<ul >
<li>Age: 20</li>
<li>Appearances: 100</li>
</ul>
"""
soup = BeautifulSoup(HTML, 'html.parser')
print(soup.select_one('ul.info li:-soup-contains("Rating: ")'))
Both will yield:
<li>Rating: 5</li>
Do you consider pre-writing your special logic in get_rating
function and then using it in a one-liner sufficient? If not, then the answer is there is no way, at least that is pretty. You can definitely construct a one-liner that would test the element and its parents, but it would be a long, ugly one-liner, defeating the purpose of the one-liner. But you can encapsulate the logic you want in a function and provide it to find
or find_all
to make its usage a pretty one-liner.
Additionally, you can do this with select
and select_one
without additional functions. The choice is yours.
I'm still not sure what you mean by "pure" API, but technically, both these are pure API, one just requires you to write your own function and pass it in.