I have a web scraper that pulls articles from CNN, FOX, and BBC in BeautifulSoup. Then after some preprocessing, I return raw articles to an API. However, I cannot figure out how to completely remove HTML tags that contain an annoying class in Python. I tried lxml cleaner but and I can remove tags, but not only the tags which contain a certain class.
If in this example I am trying to remove "help", I would like a script that would turn HTML that looks like this:
<p >Here are some tips which are useful</p>
<p> Welcome to webscraping 101 </p>
<p https://www.crummy.com/software/BeautifulSoup/bs4/doc/#decompose" rel="nofollow noreferrer">.decompose()
method:
removes a tag from the tree, then completely destroys it and its
contents
for tag in soup.find_all("p", class_="help"):
tag.decompose()
print(soup.prettify())