In BeautifulSoup, I can use find_all(string='example')
to find all NavigableStrings that match against a string or regex.
Is there a way to do this using get_text()
instead of string
, so that the search matches a string even if it spans across multiple nodes? i.e. I'd want to do something like: find_all(get_text()='Python BeautifulSoup')
, which would match against the entire inner string content.
For example, take this snippet:
<body>
<div>
Python
<br>
BeautifulSoup
</div>
</body>
If I wanted to find 'Python Beautiful Soup' and have it return both the body
and div
tags, how could I accomplish this?
CodePudding user response:
You could use css selectors
in combination with pseudo class :-soup-contains-own()
soup.select_one(':-soup-contains-own("BeautifulSoup")')
or get only text of element:
soup.select_one(':-soup-contains-own("BeautifulSoup")').get_text(' ', strip=True)
Example
from bs4 import BeautifulSoup
html = '''
<body>
<div>
Python
<br>
BeautifulSoup
</div>
</body>
'''
soup = BeautifulSoup(html)
soup.select(':-soup-contains-own("BeautifulSoup")')
Output
[<div>
Python
<br/>
BeautifulSoup
</div>]
CodePudding user response:
You can use lambda function in .find_all
:
from bs4 import BeautifulSoup
html_doc = '''\
<body>
<div>
Python
<br>
BeautifulSoup
</div>
</body>'''
soup = BeautifulSoup(html_doc, 'html.parser')
for tag in soup.find_all(lambda tag: 'Python BeautifulSoup' in tag.get_text(strip=True, separator=' ')):
print(tag.name)
Prints:
body
div