Suppose I have a list of keywords, i.e. ["apple", "dog", "cat"]
, and some HTML. I'd like to return all the elements that contain one of these keywords as its immediate descendant. How would I go about doing this?
I've tried using soup.find_all(text=keywords)
, but that yields nothing.
from bs4 import BeautifulSoup
source = """
<html>
<p>I like apples</p>
<p>I don't want to match this</p>
<div>Dogs are cool. I don't match this either.</div>
<div>I have a cat.</div>
</html>
"""
soup = BeautifulSoup(source, "html.parser")
keywords = ["apple", "dog", "cat"]
CodePudding user response:
BeautifulSoup supports regex
for text searching, so you can utilize that with the IGNORECASE
flag (since your keyword is dog
, and your element contains Dogs
)
import re
from bs4 import BeautifulSoup
source = """
<html>
<p>I like apples</p>
<p>I don't want to match this</p>
<div>Dogs are cool. I don't match this either.</div>
<div>I have a cat.</div>
</html>
"""
soup = BeautifulSoup(source, "html.parser")
keywords = ["apple", "dog", "cat"]
print(soup.find_all(text=re.compile("|".join(keywords), flags=re.IGNORECASE)))
>>>> ['I like apples', "Dogs are cool. I don't match this either.", 'I have a cat.']
As a note, you say "immediate descendant" and the element with dogs
has "I don't match this either". Due to how your HTML is formatted, it will pick this up. If this line was something like <div>Dogs are cool. <div>I don't match this either.</div></div>
, then the output would be
['I like apples', 'Dogs are cool. ', 'I have a cat.']