Home > Software design >  Get all the tags Inside HTML as a skeleton in Python
Get all the tags Inside HTML as a skeleton in Python

Time:12-08

I have HTML tags like this

<div class="block-inner clearfix">
<div class="content clearfix">
<h3><a href="/">
<img alt="" src="/sites/default/files/house-png-202.png" style="width: 35px; float: left; height: 35px;"/></a>        
<a href="/news-list">**News**</a>       
<a href="/events">**Events**</a>               
<a href="/chitin">*CHITIN*</a></h3>
</div>
</div>

i need Output like this

<div class="block-inner clearfix">
<div class="content clearfix">
<h3><a href="/">
<img alt="" src="/sites/default/files/house-png-202.png" style="width: 35px; float: left; height: 35px;"/></a>        
<a href="/news-list"></a>       
<a href="/events"></a>               
<a href="/chitin"></a></h3>
</div>
</div>

I need Only the HTML elements and the attributes inside the HTML tag , The content between the tags Not required. This is just an example.

Actually, what I want to do is...Extract Body element from A website and convert it to a website skeleton with only the tags (without content)

Is there is any way? Thanks

CodePudding user response:

To get a skeleton, you could recursively drop all text elements:

from bs4 import BeautifulSoup
from bs4.element import NavigableString

html = """<div >
<div >
<h3><a href="/">
<img alt="" src="/sites/default/files/house-png-202.png" style="width: 35px; float: left; height: 35px;"/></a>        
<a href="/news-list">**News**</a>       
<a href="/events">**Events**</a>               
<a href="/chitin">*CHITIN*</a></h3>
</div>
</div>"""

def remove_text(soup):
    contents = []
    
    for element in soup.contents:
        if not isinstance(element, NavigableString):
            contents.append(remove_text(element))
            
    soup.contents = contents
    return soup


soup = BeautifulSoup(html, "html.parser")
soup = remove_text(soup)

print(soup.prettify())

This would result in the HTML looking like:

<div class="block-inner clearfix">
 <div class="content clearfix">
  <h3>
   <a href="/">
    <img alt="" src="/sites/default/files/house-png-202.png" style="width: 35px; float: left; height: 35px;"/>
   </a>
   <a href="/news-list">
   </a>
   <a href="/events">
   </a>
   <a href="/chitin">
   </a>
  </h3>
 </div>
</div>

CodePudding user response:

I have taken your data as html and you want to remove text data from tag we can use .clear() method on tag but also there is a catch

You dont want to remove tag bewteen a tag so you can specify condtion as if it is img or any html etc tag can be there and it will not perdform .clear() operation

html="""<div >
<div >
<h3><a href="/">
<img alt="" src="/sites/default/files/house-png-202.png" style="width: 35px; float: left; height: 35px;"/></a>        
<a href="/news-list">**News**</a>       
<a href="/events">**Events**</a>               
<a href="/chitin">*CHITIN*</a></h3>
</div>
</div>"""
soup=BeautifulSoup(html,"html.parser")
main_data=soup.find("h3").find_all("a")
for i in main_data:
    try:
        if i.find_next().name=="img":
             continue 
    except:
        pass

Output:

<div class="block-inner clearfix">
<div class="content clearfix">
<h3><a href="/">
<img alt="" src="/sites/default/files/house-png-202.png" style="width: 35px; float: left; height: 35px;"/></a>
<a href="/news-list"></a>
<a href="/events"></a>
<a href="/chitin"></a></h3>
</div>
</div>
  • Related