Home > Net >  Find all tags containing a string in BeautifulSoup
Find all tags containing a string in BeautifulSoup

Time:02-01

In BeautifulSoup, I can use find_all(string='example') to find all NavigableStrings that match against a string or regex.

Is there a way to do this using get_text() instead of string, so that the search matches a string even if it spans across multiple nodes? i.e. I'd want to do something like: find_all(get_text()='Python BeautifulSoup'), which would match against the entire inner string content.

For example, take this snippet:

<body>
  <div>
    Python
    <br>
    BeautifulSoup
  </div>
</body>

If I wanted to find 'Python Beautiful Soup' and have it return both the body and div tags, how could I accomplish this?

CodePudding user response:

You could use css selectors in combination with pseudo class :-soup-contains-own()

soup.select_one(':-soup-contains-own("BeautifulSoup")')

or get only text of element:

soup.select_one(':-soup-contains-own("BeautifulSoup")').get_text(' ', strip=True)

Example

from bs4 import BeautifulSoup

html = '''
<body>
  <div>
    Python
    <br>
    BeautifulSoup
  </div>
</body>
'''
soup = BeautifulSoup(html)

soup.select(':-soup-contains-own("BeautifulSoup")')

Output

[<div>
 Python
 <br/>
 BeautifulSoup
</div>]

CodePudding user response:

You can use lambda function in .find_all:

from bs4 import BeautifulSoup

html_doc = '''\
<body>
  <div>
    Python
    <br>
    BeautifulSoup
  </div>
</body>'''

soup = BeautifulSoup(html_doc, 'html.parser')

for tag in soup.find_all(lambda tag: 'Python BeautifulSoup' in tag.get_text(strip=True, separator=' ')):
    print(tag.name)

Prints:

body
div
  • Related