I want to filter out p tag with a class from the HTML below, while not affecting any other p tags following it.
<article data-article-content>
<div class="abc">
<p class="ghf">Some Text</p>
<p>Some other Text</p>
<p>A different Text</p>
</div>
</article>
What I am using:
def myFunction(result):
result = re.sub(r'<article data-article-content><div ><p >.*?</p><\/article>',' ',result)
return result
I will call this function and printing that out should omit 'Some Text'. I am a beginner in regex. Please help with suggestions
CodePudding user response:
Use BeautifulSoup. It's a fantastic HTML parser and it has a very intuitive API. I've used it hundreds of times for large and small projects.
html = '''
<article data-article-content>
<div >
<p >Some Text</p>
<p>Some other Text</p>
<p>A different Text</p>
</div>
</article>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html)
ps = soup.find_all('p')
p_with_class = [p for p in ps if p.get('class') is not None][0]
print(p_with_class)
# <p >Some Text</p>
# Remove it.
p_with_class.decompose()
print(soup.prettify())
Output:
<html>
<body>
<article data-article-content="">
<div class="abc">
<p>
Some other Text
</p>
<p>
A different Text
</p>
</div>
</article>
</body>
</html>
More here.
CodePudding user response:
With BeautifulSoup you could transform the given HTML so that
- any
<p>
tag with classghf
is removed
Input
<article data-article-content>
<div class="abc">
<p class="ghf">Some Text</p>
<p>Some other Text</p>
<p>A different Text</p>
</div>
</article>
Expected Output
<article data-article-content="">
<div class="abc">
<p>Some other Text</p>
<p>A different Text</p>
</div>
</article>
With BeautifulSoup
Here using BeautifulSoup, version 4, also known by acronym bs4.
Install using pip
:
pip install beautifulsoup4
Then parse, find, modify and print:
from bs4 import BeautifulSoup
html = '''
<article data-article-content>
<div >
<p >Some Text</p>
<p>Some other Text</p>
<p>A different Text</p>
</div>
</article>
'''
soup = BeautifulSoup(html, features='html.parser') # parses HTML using python's internal HTML-parser
found_paragraphs = soup.find("p", {"class": "ghf"}) # find your element
found_paragraphs.extract() # removes and leaves an empty line
print(soup) # unfortunately indentation is lost
You could use prettify()
on soup
to get some indentation back.
See also
The functions used are explained in detail in related questions & answers: