Home > Blockchain >  Parsing invalid HTML and retrieving tag´s text to replace it
Parsing invalid HTML and retrieving tag´s text to replace it

Time:03-24

I need to iterate invalid HTML and obtain a text value from all tags to change it.

from bs4 import BeautifulSoup

html_doc = """
<div  data-oxy-toggle-active- data-oxy-toggle-initial-state="closed" id="_toggle-212-142">
   <div  href="#"></div>
   <div >
    <h3  id="headline-213-142"><span  id="span-225-142">Sklizeň jahod 2019</span></h3>   </div>
  </div><div  id="text_block-214-142"><span  id="span-230-142"><p>Začátek sklizně: <strong>Zahájeno</strong><br>
Otevřeno: <strong>6 h – do otrhání</strong>, denně</p>
</span></div>
"""

soup = BeautifulSoup(html_doc, "html.parser")

for tag in soup.find_all():
    print(tag.name)
    if tag.string:
        tag.string.replace_with("1")

print(soup)

The result is

<div  data-oxy-toggle-active- data-oxy-toggle-initial-state="closed" id="_toggle-212-142">
<div  href="#"></div>
<div >
<h3  id="headline-213-142"><span  id="span-225-142">1</span></h3> </div>
</div><div  id="text_block-214-142"><span  id="span-230-142"><p>Začátek sklizně: <strong>1</strong><br/>
Otevřeno: <strong>1</strong>, denně</p>
</span></div>

I know how to replace the text but bs won´t find the text of the paragraph tag. So the texts "Začátek sklizně:" and "Otevřeno:" and ", denně" are not found so I cannot replace them.

I tried using different parsers such as lxml and html5lib won´t make a difference. I tried python´s HTML library but that doesn´t support changing HTML only iterating it.

CodePudding user response:

.string returns on a tag type object a NavigableString type object -> Your tag has a single string child then returned value is that string, if it has no children or more than one child it will return None.

Scenario is not quiet clear to me, but here is one last approach based on your comment:

I need generic code to iterate any html and find all texts so I can work with them.

for tag in soup.find_all(text=True):
    tag.replace_with('1')

Example

from bs4 import BeautifulSoup

html_doc = """<div  data-oxy-toggle-active- data-oxy-toggle-initial-state="closed" id="_toggle-212-142">
   <div  href="#"></div>
   <div >
    <h3  id="headline-213-142"><span  id="span-225-142">Sklizeň jahod 2019</span></h3>   </div>
  </div><div  id="text_block-214-142"><span  id="span-230-142"><p>Začátek sklizně: <strong>Zahájeno</strong><br>
Otevřeno: <strong>6 h – do otrhání</strong>, denně</p>
</span></div>"""

soup = BeautifulSoup(html_doc, 'html.parser')

for tag in soup.find_all(text=True):
    tag.replace_with('1')

Output

<div  data-oxy-toggle-active- data-oxy-toggle-initial-state="closed" id="_toggle-212-142">1<div  href="#"></div>1<div >1<h3  id="headline-213-142"><span  id="span-225-142">1</span></h3>1</div>1</div><div  id="text_block-214-142"><span  id="span-230-142"><p>1<strong>1</strong><br/>1<strong>1</strong>1</p>1</span></div>
  • Related