I want to parse some HTML
using BeautifulSoup
and replace any line breaks (\n
) that are within <blockquote>
tags with <br>
tags. It is extra difficult because the <blockquote>
may contain other HTML tags.
My current attempt:
from bs4 import BeautifulSoup
html = """
<p>Hello
there</p>
<blockquote>Line 1
Line 2
<strong>Line 3</strong>
Line 4</blockquote>
"""
soup = BeautifulSoup(html, "html.parser")
for element in soup.findAll():
if element.name == "blockquote":
new_content = BeautifulSoup(
"<br>".join(element.get_text(strip=True).split("\n")).strip("<br>"),
"html.parser",
)
element.string.replace_with(new_content)
print(str(soup))
Output should be:
<p>Hello
there</p>
<blockquote>Line 1<br/>Line 2<br/><strong>Line 3</strong><br/>Line 4</blockquote>
However, this code, adapted from this answer only works if there are no HTML tags within the <blockquote>
. But if there are (the <strong>Line 3</strong>
) then element.string
is None
, and the above fails.
Is there an alternative that can cope with HTML tags?
CodePudding user response:
An alternative would be to use descendants
to look for NavigableString
s, and replace just those, leaving other elements alone:
from bs4 import BeautifulSoup, NavigableString
html = """
<p>Hello
there</p>
<blockquote>Line 1
Line 2
<strong>Line 3</strong>
Line 4</blockquote>
"""
soup = BeautifulSoup(html, "html.parser")
for quote in soup.find_all("blockquote"):
for element in list(quote.descendants):
if type(element) is NavigableString:
markup = element.string.replace("\n", "<br>")
element.string.replace_with(BeautifulSoup(markup, "html.parser"))
print(str(soup))
Output:
<p>Hello
there</p>
<blockquote>Line 1<br/>Line 2<br/><strong>Line 3</strong><br/>Line 4</blockquote>
An advantage of this approach is that it doesn't touch, for example, HTML comments:
<blockquote>
<!--
a comment
-->
</blockquote>
is turned into
<blockquote><br/><!--
a comment
--><br/></blockquote>
as you might expect.
CodePudding user response:
It is much simpler to select your elements more specific and work on the elements itself as string
while using replace()
.
This way you don't have to worry about other tags that would otherwise be present as objects and are not represented as string in result of get_text()
.
new_content = BeautifulSoup(
str(element).replace('\n','<br>'),
"html.parser",
)
element.replace_with(new_content)
Example
from bs4 import BeautifulSoup
html = """
<p>Hello
there</p>
<blockquote>Line 1
Line 2
<strong>Line 3</strong>
Line 4</blockquote>
"""
soup = BeautifulSoup(html, "html.parser")
for element in soup.find_all('blockquote'):
new_content = BeautifulSoup(
str(element).replace('\n','<br>'),
"html.parser",
)
element.replace_with(new_content)
print(str(soup))
Output
<p>Hello
there</p>
<blockquote>Line 1<br/>Line 2<br/><strong>Line 3</strong><br/>Line 4</blockquote>