I have HTML tags like this
<div class="block-inner clearfix">
<div class="content clearfix">
<h3><a href="/">
<img alt="" src="/sites/default/files/house-png-202.png" style="width: 35px; float: left; height: 35px;"/></a>
<a href="/news-list">**News**</a>
<a href="/events">**Events**</a>
<a href="/chitin">*CHITIN*</a></h3>
</div>
</div>
i need Output like this
<div class="block-inner clearfix">
<div class="content clearfix">
<h3><a href="/">
<img alt="" src="/sites/default/files/house-png-202.png" style="width: 35px; float: left; height: 35px;"/></a>
<a href="/news-list"></a>
<a href="/events"></a>
<a href="/chitin"></a></h3>
</div>
</div>
I need Only the HTML elements and the attributes inside the HTML tag , The content between the tags Not required. This is just an example.
Actually, what I want to do is...Extract Body element from A website and convert it to a website skeleton with only the tags (without content)
Is there is any way? Thanks
CodePudding user response:
To get a skeleton, you could recursively drop all text elements:
from bs4 import BeautifulSoup
from bs4.element import NavigableString
html = """<div >
<div >
<h3><a href="/">
<img alt="" src="/sites/default/files/house-png-202.png" style="width: 35px; float: left; height: 35px;"/></a>
<a href="/news-list">**News**</a>
<a href="/events">**Events**</a>
<a href="/chitin">*CHITIN*</a></h3>
</div>
</div>"""
def remove_text(soup):
contents = []
for element in soup.contents:
if not isinstance(element, NavigableString):
contents.append(remove_text(element))
soup.contents = contents
return soup
soup = BeautifulSoup(html, "html.parser")
soup = remove_text(soup)
print(soup.prettify())
This would result in the HTML looking like:
<div class="block-inner clearfix">
<div class="content clearfix">
<h3>
<a href="/">
<img alt="" src="/sites/default/files/house-png-202.png" style="width: 35px; float: left; height: 35px;"/>
</a>
<a href="/news-list">
</a>
<a href="/events">
</a>
<a href="/chitin">
</a>
</h3>
</div>
</div>
CodePudding user response:
I have taken your data as
html
and you want to remove text data from tag we can use.clear()
method on tag but also there is a catchYou dont want to remove tag bewteen
a
tag so you can specify condtion as if it isimg
or anyhtml
etc tag can be there and it will not perdform.clear()
operation
html="""<div >
<div >
<h3><a href="/">
<img alt="" src="/sites/default/files/house-png-202.png" style="width: 35px; float: left; height: 35px;"/></a>
<a href="/news-list">**News**</a>
<a href="/events">**Events**</a>
<a href="/chitin">*CHITIN*</a></h3>
</div>
</div>"""
soup=BeautifulSoup(html,"html.parser")
main_data=soup.find("h3").find_all("a")
for i in main_data:
try:
if i.find_next().name=="img":
continue
except:
pass
Output:
<div class="block-inner clearfix">
<div class="content clearfix">
<h3><a href="/">
<img alt="" src="/sites/default/files/house-png-202.png" style="width: 35px; float: left; height: 35px;"/></a>
<a href="/news-list"></a>
<a href="/events"></a>
<a href="/chitin"></a></h3>
</div>
</div>