Home > Net >  Python Beautiful Soup search for elements that contain text from list
Python Beautiful Soup search for elements that contain text from list

Time:09-29

Suppose I have a list of keywords, i.e. ["apple", "dog", "cat"], and some HTML. I'd like to return all the elements that contain one of these keywords as its immediate descendant. How would I go about doing this?

I've tried using soup.find_all(text=keywords), but that yields nothing.

from bs4 import BeautifulSoup

source = """
<html>
    <p>I like apples</p>
    <p>I don't want to match this</p>
    <div>Dogs are cool. I don't match this either.</div>
    <div>I have a cat.</div>
</html>
"""
soup = BeautifulSoup(source, "html.parser")
keywords = ["apple", "dog", "cat"]

CodePudding user response:

BeautifulSoup supports regex for text searching, so you can utilize that with the IGNORECASE flag (since your keyword is dog, and your element contains Dogs)

import re
from bs4 import BeautifulSoup

source = """
<html>
    <p>I like apples</p>
    <p>I don't want to match this</p>
    <div>Dogs are cool. I don't match this either.</div>
    <div>I have a cat.</div>
</html>
"""
soup = BeautifulSoup(source, "html.parser")
keywords = ["apple", "dog", "cat"]

print(soup.find_all(text=re.compile("|".join(keywords), flags=re.IGNORECASE)))


>>>> ['I like apples', "Dogs are cool. I don't match this either.", 'I have a cat.']

As a note, you say "immediate descendant" and the element with dogs has "I don't match this either". Due to how your HTML is formatted, it will pick this up. If this line was something like <div>Dogs are cool. <div>I don't match this either.</div></div>, then the output would be

['I like apples', 'Dogs are cool. ', 'I have a cat.']

  • Related