Home > Enterprise >  How to wrap a new tag around multiple tags with BeautifulSoup?
How to wrap a new tag around multiple tags with BeautifulSoup?

Time:10-01

In the example below, I am trying to wrap a <content> tag around all the <p> tags in a section. Each section is within an <item>, but the <title> needs to stay outside of the <content>. How can I do this?

Source file:

<item>
<title>Heading for Sec 1</title>
    <p>some text sec 1</p>
    <p>some text sec 1</p>
    <p>some text sec 1</p>
</item>

<item>
<title>Heading for Sec 2</title>
    <p>some text sec 2</p>
    <p>some text sec 2</p>
    <p>some text sec 2</p>
</item>

<item>
<title>Heading for Sec 3</title>
    <p>some text sec 3</p>
    <p>some text sec 3</p>
</item>

I want this output:

<item>
<title>Heading for Sec 1</title>
    <content>
    <p>some text sec 1</p>
    <p>some text sec 1</p>
    </content>
</item>

<item>
<title>Heading for Sec 2</title>
    <content>
    <p>some text sec 2</p>
    <p>some text sec 2</p>
    <p>some text sec 2</p>
    </content>
</item>

<item>
<title>Heading for Sec 3</title>
    <content>
    <p>some text sec 3</p>
    <p>some text sec 3</p>
    </content>
</item>

The below code is what I am trying. However, it wraps a <content> tag around every <p> tag, instead of around all the <p> tags in a section. How can I fix this?

from bs4 import BeautifulSoup
with open('testdoc.txt', 'r') as f:
    soup = BeautifulSoup(f, "html.parser")

content = None
for tag in soup.select("p"):  
    if tag.name == "p":
        content = tag.wrap(soup.new_tag("content"))
        content.append(tag)
        continue

print(soup)

CodePudding user response:

Try:

from bs4 import BeautifulSoup

html_doc = """\
<item>
<title>Heading for Sec 1</title>
    <p>some text sec 1</p>
    <p>some text sec 1</p>
    <p>some text sec 1</p>
</item>

<item>
<title>Heading for Sec 2</title>
    <p>some text sec 2</p>
    <p>some text sec 2</p>
    <p>some text sec 2</p>
</item>

<item>
<title>Heading for Sec 3</title>
    <p>some text sec 3</p>
    <p>some text sec 3</p>
</item>"""


soup = BeautifulSoup(html_doc, "html.parser")

for item in soup.select("item"):
    t = soup.new_tag("content")
    t.append("\n")
    item.title.insert_after(t)
    item.title.insert_after("\n")

    for p in item.select("p"):
        t.append(p)
        t.append("\n")

    item.smooth()
    for t in item.find_all(text=True, recursive=False):
        t.replace_with("\n")

print(soup)

Prints:

<item>
<title>Heading for Sec 1</title>
<content>
<p>some text sec 1</p>
<p>some text sec 1</p>
<p>some text sec 1</p>
</content>
</item>
<item>
<title>Heading for Sec 2</title>
<content>
<p>some text sec 2</p>
<p>some text sec 2</p>
<p>some text sec 2</p>
</content>
</item>
<item>
<title>Heading for Sec 3</title>
<content>
<p>some text sec 3</p>
<p>some text sec 3</p>
</content>
</item>
  • Related