Home > OS >  BeautifulSoup: Find an element based it's own attributes and those of an ancestor
BeautifulSoup: Find an element based it's own attributes and those of an ancestor

Time:01-12

I'm using BeautifulSoup to parse HTML and locate specific elements on a page.

Is there a way to rationalise the following attempt to pluck out a single element with a single find() call based on both attributes of the target element and attributes of it's ancestors?

HTML

<ul >
  <li>Name: Mickey Mouse</li>
  <li>Height: 3ft</li>
</ul>
<ul >
  <li>Rating: 5</li>
  <li>Score: 6</li>
</ul>
<ul >
  <li>Age: 20</li>
  <li>Appearances: 100</li>
</ul>

PYTHON

ancestors = soup.find_all("ul", class_="info")

for ancestor in ancestors:
    elem = ancestor.find("li", string=lambda s: s.startswith("Rating: "))
    if elem: break

In other words, can I add search conditions of ancestral elements in a find() call?

The question is a generic one about the capabilities of the find() method, not about the specific example given above which is arbitrary.

Taking ancestral properties into account is possible using the select_one() method - which uses CSS selectors.

For example (ignoring the need to select by the prefixed text):

soup.select_one("ul.info li")

This will return all <li> tags who have an ancestor that is a <ul> tag with a class value of info.

Reading the documentation, I can't see an equivalent one-liner using the Beautiful Soup "pure" API that can do the same thing.

CodePudding user response:

You can do so, sort of, with find. I'm not quite sure what is being suggested by "pure" API, but let's get into it.

So, first, let's start with find. Find has many capabilities. You can filter elements by tag name, attribute properties, regex on tag name or attribute properties, or even content. You can also pass functions into find, and this is the only way to do far more advanced stuff.

from bs4 import BeautifulSoup

HTML = """
<ul >
  <li>Name: Mickey Mouse</li>
  <li>Height: 3ft</li>
</ul>
<ul >
  <li>Rating: 5</li>
  <li>Score: 6</li>
</ul>
<ul >
  <li>Age: 20</li>
  <li>Appearances: 100</li>
</ul>
"""

def get_ratings(el):
    if el.name == 'li' and el.string.startswith("Rating: "):
        parent = el.parent
        if parent.name == 'ul' and 'info' in parent.attrs['class']:
            return True
    return False


soup = BeautifulSoup(HTML, 'html.parser')

print(soup.find(get_ratings))

Wit that said, you can also do this with CSS selectors. We can't necessarily test for the prefix of "Ratings: ", but we can test if the element contains "Ratings: " with the custom CSS selector called :-soup-contains():

from bs4 import BeautifulSoup

HTML = """
<ul >
  <li>Name: Mickey Mouse</li>
  <li>Height: 3ft</li>
</ul>
<ul >
  <li>Rating: 5</li>
  <li>Score: 6</li>
</ul>
<ul >
  <li>Age: 20</li>
  <li>Appearances: 100</li>
</ul>
"""

soup = BeautifulSoup(HTML, 'html.parser')

print(soup.select_one('ul.info li:-soup-contains("Rating: ")'))

Both will yield:

<li>Rating: 5</li>

Do you consider pre-writing your special logic in get_rating function and then using it in a one-liner sufficient? If not, then the answer is there is no way, at least that is pretty. You can definitely construct a one-liner that would test the element and its parents, but it would be a long, ugly one-liner, defeating the purpose of the one-liner. But you can encapsulate the logic you want in a function and provide it to find or find_all to make its usage a pretty one-liner.

Additionally, you can do this with select and select_one without additional functions. The choice is yours.

I'm still not sure what you mean by "pure" API, but technically, both these are pure API, one just requires you to write your own function and pass it in.

  • Related