Replacing line breaks with <br> inside a tag using BeautifulSoup-CodePudding

I want to parse some HTML using BeautifulSoup and replace any line breaks (\n) that are within <blockquote> tags with <br> tags. It is extra difficult because the <blockquote> may contain other HTML tags.

My current attempt:

from bs4 import BeautifulSoup

html = """
<p>Hello
there</p>
<blockquote>Line 1
Line 2
<strong>Line 3</strong>
Line 4</blockquote>
"""

soup = BeautifulSoup(html, "html.parser")

for element in soup.findAll():
    if element.name == "blockquote":
        new_content = BeautifulSoup(
            "<br>".join(element.get_text(strip=True).split("\n")).strip("<br>"),
            "html.parser",
        )
        element.string.replace_with(new_content)

print(str(soup))

Output should be:

<p>Hello
there</p>
<blockquote>Line 1<br/>Line 2<br/><strong>Line 3</strong><br/>Line 4</blockquote>

However, this code, adapted from this answer only works if there are no HTML tags within the <blockquote>. But if there are (the <strong>Line 3</strong>) then element.string is None, and the above fails.

Is there an alternative that can cope with HTML tags?

CodePudding user response：

An alternative would be to use descendants to look for NavigableStrings, and replace just those, leaving other elements alone:

from bs4 import BeautifulSoup, NavigableString

html = """
<p>Hello
there</p>
<blockquote>Line 1
Line 2
<strong>Line 3</strong>
Line 4</blockquote>
"""

soup = BeautifulSoup(html, "html.parser")

for quote in soup.find_all("blockquote"):
    for element in list(quote.descendants):
        if type(element) is NavigableString:
            markup = element.string.replace("\n", "<br>")
            element.string.replace_with(BeautifulSoup(markup, "html.parser"))

print(str(soup))

Output:

<p>Hello
there</p>
<blockquote>Line 1<br/>Line 2<br/><strong>Line 3</strong><br/>Line 4</blockquote>

An advantage of this approach is that it doesn't touch, for example, HTML comments:

<blockquote>
<!--
a comment
-->
</blockquote>

is turned into

<blockquote><br/><!--
a comment
--><br/></blockquote>

as you might expect.

CodePudding user response：

It is much simpler to select your elements more specific and work on the elements itself as string while using replace().

This way you don't have to worry about other tags that would otherwise be present as objects and are not represented as string in result of get_text().

new_content = BeautifulSoup(
    str(element).replace('\n','<br>'),
    "html.parser",
)
element.replace_with(new_content)

Example

from bs4 import BeautifulSoup

html = """
<p>Hello
there</p>
<blockquote>Line 1
Line 2
<strong>Line 3</strong>
Line 4</blockquote>
"""

soup = BeautifulSoup(html, "html.parser")

for element in soup.find_all('blockquote'):
    new_content = BeautifulSoup(
        str(element).replace('\n','<br>'),
        "html.parser",
    )
    element.replace_with(new_content)

print(str(soup))

Output

<p>Hello 
there</p>
<blockquote>Line 1<br/>Line 2<br/><strong>Line 3</strong><br/>Line 4</blockquote>