Home > database >  How can I specifically remove a tag with a class using re.sub
How can I specifically remove a tag with a class using re.sub

Time:10-26

I want to filter out p tag with a class from the HTML below, while not affecting any other p tags following it.

<article data-article-content>
    <div class="abc">
       <p class="ghf">Some Text</p>
       <p>Some other Text</p>
       <p>A different Text</p>
    </div>
</article>

What I am using:

def myFunction(result):
    result = re.sub(r'<article data-article-content><div ><p >.*?</p><\/article>',' ',result)
    return result

I will call this function and printing that out should omit 'Some Text'. I am a beginner in regex. Please help with suggestions

CodePudding user response:

Use BeautifulSoup. It's a fantastic HTML parser and it has a very intuitive API. I've used it hundreds of times for large and small projects.

html = '''
<article data-article-content>
    <div >
       <p >Some Text</p>
       <p>Some other Text</p>
       <p>A different Text</p>
    </div>
</article>
'''

from bs4 import BeautifulSoup

soup = BeautifulSoup(html)

ps = soup.find_all('p')

p_with_class = [p for p in ps if p.get('class') is not None][0]

print(p_with_class)
# <p >Some Text</p>

# Remove it.
p_with_class.decompose()

print(soup.prettify())

Output:

<html>
 <body>
  <article data-article-content="">
   <div class="abc">
    <p>
     Some other Text
    </p>
    <p>
     A different Text
    </p>
   </div>
  </article>
 </body>
</html>

More here.

CodePudding user response:

With BeautifulSoup you could transform the given HTML so that

  • any <p> tag with class ghf is removed

Input

<article data-article-content>
    <div class="abc">
       <p class="ghf">Some Text</p>
       <p>Some other Text</p>
       <p>A different Text</p>
    </div>
</article>

Expected Output

<article data-article-content="">
<div class="abc">

<p>Some other Text</p>
<p>A different Text</p>
</div>
</article>

With BeautifulSoup

Here using BeautifulSoup, version 4, also known by acronym bs4.

Install using pip:

pip install beautifulsoup4

Then parse, find, modify and print:

from bs4 import BeautifulSoup

html = '''
<article data-article-content>
    <div >
       <p >Some Text</p>
       <p>Some other Text</p>
       <p>A different Text</p>
    </div>
</article>
'''

soup = BeautifulSoup(html, features='html.parser') # parses HTML using python's internal HTML-parser

found_paragraphs = soup.find("p", {"class": "ghf"}) # find your element
found_paragraphs.extract() # removes and leaves an empty line

print(soup) # unfortunately indentation is lost

You could use prettify() on soup to get some indentation back.

See also

The functions used are explained in detail in related questions & answers:

  • Related