I'm using beautifulsoup to try and locate a P tag in an xml parse tree based on its contents:
# Import required modules.
from datetime import date
import requests
from bs4 import BeautifulSoup
# Determine today's date.
today = date.today()
# Define the URL to be scraped.
url = f"https://www.ecfr.gov/api/versioner/v1/full/{today}/title-22.xml?chapter=I&subchapter=M&part=121§ion=121.1"
# Initialize a requests Response object.
page = requests.get(url)
# Parse the XML data with BeautifulSoup.
soup = BeautifulSoup(page.content, features="xml")
# Remove tags with irregular text.
for i in soup.find_all("P", text="(See § 125.4 of this subchapter for exemptions.) "):
print(i)
i.decompose()
When running this code, I receive a NoneType object (prints None to the console) even though I know the element exists by reviewing the XML file (including the trailing nbsp). Does beautiful soup have a problem with Unicode, or am I missing something else?
Thanks!
CodePudding user response:
Main issue is that text="(See § 125.4 of this subchapter for exemptions.) "
looks for an exact match, but wont find one, cause in your xml it looks like (<I>See</I> § 125.4 of this subchapter for exemptions.)
.
You could fix that behavior using css selectors
and :-soup-contains()
:
for i in soup.select('P:-soup-contains("See § 125.4 of this subchapter for exemptions.")'):
print(i)
i.decompose()
Example
from datetime import date
import requests
from bs4 import BeautifulSoup
# Determine today's date.
today = date.today()
# Define the URL to be scraped.
url = f"https://www.ecfr.gov/api/versioner/v1/full/{today}/title-22.xml?chapter=I&subchapter=M&part=121§ion=121.1"
# Initialize a requests Response object.
page = requests.get(url)
# Parse the XML data with BeautifulSoup.
soup = BeautifulSoup(page.content, features="xml")
# Remove tags with irregular text.
for i in soup.select('P:-soup-contains("See § 125.4 of this subchapter for exemptions.")'):
print(i)
i.decompose()