Home > Enterprise >  Does the Python beautifulsoup.find_all(text=) have a problem with Unicode characters?
Does the Python beautifulsoup.find_all(text=) have a problem with Unicode characters?

Time:05-27

I'm using beautifulsoup to try and locate a P tag in an xml parse tree based on its contents:

# Import required modules.
from datetime import date
import requests
from bs4 import BeautifulSoup

# Determine today's date.
today = date.today()

# Define the URL to be scraped.
url = f"https://www.ecfr.gov/api/versioner/v1/full/{today}/title-22.xml?chapter=I&subchapter=M&part=121&section=121.1"

# Initialize a requests Response object.
page = requests.get(url)

# Parse the XML data with BeautifulSoup.
soup = BeautifulSoup(page.content, features="xml")

# Remove tags with irregular text.
for i in soup.find_all("P", text="(See § 125.4 of this subchapter for exemptions.) "):
    print(i)
    i.decompose()

When running this code, I receive a NoneType object (prints None to the console) even though I know the element exists by reviewing the XML file (including the trailing nbsp). Does beautiful soup have a problem with Unicode, or am I missing something else?

Thanks!

CodePudding user response:

Main issue is that text="(See § 125.4 of this subchapter for exemptions.) " looks for an exact match, but wont find one, cause in your xml it looks like (<I>See</I> § 125.4 of this subchapter for exemptions.) .

You could fix that behavior using css selectors and :-soup-contains():

for i in soup.select('P:-soup-contains("See § 125.4 of this subchapter for exemptions.")'):
    print(i)
    i.decompose()
Example
from datetime import date
import requests
from bs4 import BeautifulSoup

# Determine today's date.
today = date.today()

# Define the URL to be scraped.
url = f"https://www.ecfr.gov/api/versioner/v1/full/{today}/title-22.xml?chapter=I&subchapter=M&part=121&section=121.1"

# Initialize a requests Response object.
page = requests.get(url)

# Parse the XML data with BeautifulSoup.
soup = BeautifulSoup(page.content, features="xml")

# Remove tags with irregular text.
for i in soup.select('P:-soup-contains("See § 125.4 of this subchapter for exemptions.")'):
    print(i)
    i.decompose()
  • Related