I've been stuck on the same problem for a day and a half now and nothing seems to work. I am parsing HTML files and extracting paragraphs of text. However, some pages are structured like this:
<p>First paragraph. <br/>Second paragraph.<br/>Third paragraph</p>
My desired output is this:
<p>First paragraph.</p>
<p>Second paragraph.</p>
<p>Third paragraph.</p>
I tried the BS4 replace_with
function but it doesn't seem to be working, as I get this error: TypeError: 'NoneType' object is not callable
:
from bs4 import BeautifulSoup
html = "<p>First paragraph. <br/>Second paragraph.<br/>Third paragraph</p>"
soup = BeautifulSoup(html, "html.parser")
allparas = soup.find_all('p') #In the actual files there is more code
for p in allparas:
if p.find_all(["br", "br/"]): #Some files don't have br tags
for br in p.find_all(["br", "br/"]):
new_p = br.new_tag('p', closed=True)
br.replace_with(new_p)
The closest I've gotten is by replacing the tag with a string, but something seems to be going wrong with the encoding:
if html.find_all(["br", "br/"]):
for br in html.find_all(["br", "br/"]):
br.replace_with("</p><p>")
reslist = [p for p in html.find_all("p")]
allparas = ''.join(str(p) for p in reslist) #Overwriting allparas here as I need it later
This works, but my print output is as follows:
<p>First paragraph.</p><p>Second paragraph.</p><p>Third paragraph.</p>
Something is going wrong with converting the string to a BS4 tag. Any help would be immensely appreciated!
CodePudding user response:
I would it with css selectors (just a personal preference). In any case, based exclusively on your sample html, you can do something like this:
for s in list(soup.strings):
#wrap the text segments with a new tag
s.wrap(soup.new_tag("p"))
for br in soup.select('br'):
#remove the original br tags
br.extract()
soup
Output should be your expected output.
CodePudding user response:
Regular function
Here is an implementation, which handles arbitrary siblings tags of those <br>
tags (not just strings):
from bs4 import BeautifulSoup, Tag
def breaks_to_paragraphs(
tag: Tag,
soup: BeautifulSoup,
recursive: bool = False,
) -> None:
"""
If `tag` contains <br> elements, it is split into `<p>` tags instead.
The `<br>` tags are removed from `tag`.
If no `<br>` tags are found, this function does nothing.
Args:
tag:
The `Tag` instance to mutate
soup:
The `BeautifulSoup` instance the tag belongs to (for `new_tag`)
recursive (optional):
If `True`, the function is applied to all nested tags recursively;
otherwise (default) only the children are affected.
"""
elements = []
contains_br = False
for child in list(tag.children):
if isinstance(child, Tag) and child.name != "br":
if recursive:
breaks_to_paragraphs(child, soup, recursive=recursive)
elements.append(child)
elif not isinstance(child, Tag): # it is a `NavigableString`
elements.append(child)
else: # it is a `<br>` tag
contains_br = True
p = soup.new_tag("p")
child.replace_with(p)
p.extend(elements)
elements.clear()
if elements and contains_br:
p = soup.new_tag("p")
tag.append(p)
p.extend(elements)
soup.smooth()
Subclass method
Alternatively, since you need the original BeautifulSoup
instance for calling the new_tag
method, you can also subclass it and implement this as a method instead:
from bs4 import BeautifulSoup, Tag
class CustomSoup(BeautifulSoup):
def breaks_to_paragraphs(self, tag: Tag, recursive: bool = False) -> None:
"""
If `tag` contains <br> elements, it is split into `<p>` tags instead.
The `<br>` tags are removed from `tag`.
If no `<br>` tags are found, this method does nothing.
Args:
tag:
The `Tag` instance to mutate
recursive (optional):
If `True`, the function is applied to all nested tags recursively;
otherwise (default) only the children are affected.
"""
elements = []
contains_br = False
for child in list(tag.children):
if isinstance(child, Tag) and child.name != "br":
if recursive:
self.breaks_to_paragraphs(child, recursive=recursive)
elements.append(child)
elif not isinstance(child, Tag): # it is a `NavigableString`
elements.append(child)
else: # it is a `<br>` tag
contains_br = True
p = self.new_tag("p")
child.replace_with(p)
p.extend(elements)
elements.clear()
if elements and contains_br:
p = self.new_tag("p")
tag.append(p)
p.extend(elements)
self.smooth()
Demo
Here is a quick test:
...
def main() -> None:
html = """
<p>
First paragraph. <br/>
Second paragraph.<br/>
<span>foo</span>
<span>bar<br>baz</span>
</p>
"""
soup = CustomSoup(html, "html.parser")
soup.breaks_to_paragraphs(soup.p)
print(soup.p.prettify())
if __name__ == "__main__":
main()
Output:
<p>
<p>
First paragraph.
</p>
<p>
Second paragraph.
</p>
<p>
<span>
foo
</span>
<span>
bar
<br/>
baz
</span>
</p>
</p>
If you call it with soup.breaks_to_paragraphs(soup.p, recursive=True)
instead:
<p>
<p>
First paragraph.
</p>
<p>
Second paragraph.
</p>
<p>
<span>
foo
</span>
<span>
<p>
bar
</p>
<p>
baz
</p>
</span>
</p>
</p>
Notice how it split into <p>
tags along the nested <br>
here as well.