Home > other >  How can I download the plain HTML of a website, stripped of all classes, ids, attributes, etc?
How can I download the plain HTML of a website, stripped of all classes, ids, attributes, etc?

Time:12-02

For example, for a page with source code

<h1 id='page-title'>This is a page</h1>
<div class='page-part'>
    <button id='red-button' style='background-color:Red'>I'm a button</button>
    <button id='blue-button' style='background-color:Blue'>I'm a button</button>
</div>

I want to get

<h1>This is a page</h1>
<div>
    <button>I'm a button</button>
    <button>I'm a button</button>
</div>

How can I do this?

CodePudding user response:

Then let's do it in python, using ElementTree and xpath:

import xml.etree.ElementTree as ET

#I changed your html a bit, to make sure the code works
mu = """
<html>
<h1 id='page-title'>This is a page</h1>
<div class='page-part'>
    <button id='red-button' style='background-color:Red'>I'm a button</button>
    <button id='blue-button' value = "yo" style='background-color:Blue'>I'm also a button</button>
</div>
<div>I have no attributes</div>
</html>
"""

doc1 = ET.fromstring(mu)
to_del = [] #initialize a list of attributes to delete
for elem in doc1.findall('.//*'): #get all elements in the html
    #get all attribute names and add them to the list
    to_del.extend(list(elem.attrib.keys()))

#once the attribute list is ready, eliminate duplicates, iterate 
#over the list and find all elements which have the particular attribute
for td in set(to_del):    
        for elem in doc1.findall('.//*'):
            #delete the attribute
            elem.attrib.pop(td, None) 

print(ET.tostring(doc1).decode())

Output:

<html>
  <h1>This is a page</h1>
  <div>
    <button>I'm a button</button>
    <button>I'm also a button</button>
  </div>
  <div>I have no attributes</div>
</html>
  • Related