I need to iterate invalid HTML and obtain a text value from all tags to change it.
from bs4 import BeautifulSoup
html_doc = """
<div data-oxy-toggle-active- data-oxy-toggle-initial-state="closed" id="_toggle-212-142">
<div href="#"></div>
<div >
<h3 id="headline-213-142"><span id="span-225-142">Sklizeň jahod 2019</span></h3> </div>
</div><div id="text_block-214-142"><span id="span-230-142"><p>Začátek sklizně: <strong>Zahájeno</strong><br>
Otevřeno: <strong>6 h – do otrhání</strong>, denně</p>
</span></div>
"""
soup = BeautifulSoup(html_doc, "html.parser")
for tag in soup.find_all():
print(tag.name)
if tag.string:
tag.string.replace_with("1")
print(soup)
The result is
<div data-oxy-toggle-active- data-oxy-toggle-initial-state="closed" id="_toggle-212-142">
<div href="#"></div>
<div >
<h3 id="headline-213-142"><span id="span-225-142">1</span></h3> </div>
</div><div id="text_block-214-142"><span id="span-230-142"><p>Začátek sklizně: <strong>1</strong><br/>
Otevřeno: <strong>1</strong>, denně</p>
</span></div>
I know how to replace the text but bs won´t find the text of the paragraph tag. So the texts "Začátek sklizně:" and "Otevřeno:" and ", denně" are not found so I cannot replace them.
I tried using different parsers such as lxml and html5lib won´t make a difference. I tried python´s HTML library but that doesn´t support changing HTML only iterating it.
CodePudding user response:
.string
returns on a tag type object a NavigableString
type object -> Your tag has a single string child then returned value is that string, if
it has no children or more than one child it will return None
.
Scenario is not quiet clear to me, but here is one last approach based on your comment:
I need generic code to iterate any html and find all texts so I can work with them.
for tag in soup.find_all(text=True):
tag.replace_with('1')
Example
from bs4 import BeautifulSoup
html_doc = """<div data-oxy-toggle-active- data-oxy-toggle-initial-state="closed" id="_toggle-212-142">
<div href="#"></div>
<div >
<h3 id="headline-213-142"><span id="span-225-142">Sklizeň jahod 2019</span></h3> </div>
</div><div id="text_block-214-142"><span id="span-230-142"><p>Začátek sklizně: <strong>Zahájeno</strong><br>
Otevřeno: <strong>6 h – do otrhání</strong>, denně</p>
</span></div>"""
soup = BeautifulSoup(html_doc, 'html.parser')
for tag in soup.find_all(text=True):
tag.replace_with('1')
Output
<div data-oxy-toggle-active- data-oxy-toggle-initial-state="closed" id="_toggle-212-142">1<div href="#"></div>1<div >1<h3 id="headline-213-142"><span id="span-225-142">1</span></h3>1</div>1</div><div id="text_block-214-142"><span id="span-230-142"><p>1<strong>1</strong><br/>1<strong>1</strong>1</p>1</span></div>