In the example below, I am trying to wrap a <content>
tag around all the <p>
tags in a section. Each section is within an <item>
, but the <title>
needs to stay outside of the <content>
. How can I do this?
Source file:
<item>
<title>Heading for Sec 1</title>
<p>some text sec 1</p>
<p>some text sec 1</p>
<p>some text sec 1</p>
</item>
<item>
<title>Heading for Sec 2</title>
<p>some text sec 2</p>
<p>some text sec 2</p>
<p>some text sec 2</p>
</item>
<item>
<title>Heading for Sec 3</title>
<p>some text sec 3</p>
<p>some text sec 3</p>
</item>
I want this output:
<item>
<title>Heading for Sec 1</title>
<content>
<p>some text sec 1</p>
<p>some text sec 1</p>
</content>
</item>
<item>
<title>Heading for Sec 2</title>
<content>
<p>some text sec 2</p>
<p>some text sec 2</p>
<p>some text sec 2</p>
</content>
</item>
<item>
<title>Heading for Sec 3</title>
<content>
<p>some text sec 3</p>
<p>some text sec 3</p>
</content>
</item>
The below code is what I am trying. However, it wraps a <content>
tag around every <p>
tag, instead of around all the <p>
tags in a section. How can I fix this?
from bs4 import BeautifulSoup
with open('testdoc.txt', 'r') as f:
soup = BeautifulSoup(f, "html.parser")
content = None
for tag in soup.select("p"):
if tag.name == "p":
content = tag.wrap(soup.new_tag("content"))
content.append(tag)
continue
print(soup)
CodePudding user response:
Try:
from bs4 import BeautifulSoup
html_doc = """\
<item>
<title>Heading for Sec 1</title>
<p>some text sec 1</p>
<p>some text sec 1</p>
<p>some text sec 1</p>
</item>
<item>
<title>Heading for Sec 2</title>
<p>some text sec 2</p>
<p>some text sec 2</p>
<p>some text sec 2</p>
</item>
<item>
<title>Heading for Sec 3</title>
<p>some text sec 3</p>
<p>some text sec 3</p>
</item>"""
soup = BeautifulSoup(html_doc, "html.parser")
for item in soup.select("item"):
t = soup.new_tag("content")
t.append("\n")
item.title.insert_after(t)
item.title.insert_after("\n")
for p in item.select("p"):
t.append(p)
t.append("\n")
item.smooth()
for t in item.find_all(text=True, recursive=False):
t.replace_with("\n")
print(soup)
Prints:
<item>
<title>Heading for Sec 1</title>
<content>
<p>some text sec 1</p>
<p>some text sec 1</p>
<p>some text sec 1</p>
</content>
</item>
<item>
<title>Heading for Sec 2</title>
<content>
<p>some text sec 2</p>
<p>some text sec 2</p>
<p>some text sec 2</p>
</content>
</item>
<item>
<title>Heading for Sec 3</title>
<content>
<p>some text sec 3</p>
<p>some text sec 3</p>
</content>
</item>