Using BeautifulSoup, how to select a tag without its children?-CodePudding

The html is as follows:

<body>
    <div name='tag-i-want'>
        <span>I don't want this</span>
    </div>
</body>

I'm trying to get all the divs and cast them into strings:

divs = [str(i) for i in soup.find_all('div')]

However, they'll have their children too:

>>> ["<div name='tag-i-want'><span>I don't want this</span></div>"]

What I'd like it to be is:

>>> ["<div name='tag-i-want'></div>"]

I figured there is unwrap() which would return this, but it modifies the soup as well; I'd like the soup to remain untouched.

CodePudding user response：

With clear you remove the tag's content. Without altering the soup you can either do an hardcopy with copy or use a DIY approach. Here an example with the copy

from bs4 import BeautifulSoup
import copy

html = """<body>
    <div name='tag-i-want'>
        <span>I don't want this</span>
    </div>
</body>"""

soup = BeautifulSoup(html, 'lxml')
div = soup.find('div')

div_only = copy.copy(div)
div_only.clear()


print(div_only)
print(soup.find_all('span') != [])

Output

<div name="tag-i-want"></div>
True

Remark: the DIY approach: without copy

use the Tag class

from bs4 import BeautifulSoup, Tag
...
div_only = Tag(name='div', attrs=div.attrs)

use strings

div_only = '<div {}></div>'.format(' '.join(map(lambda p: f'{p[0]}="{p[1]}"', div.attrs.items())))

CodePudding user response：

@cards pointed me in the right direction with copy(). This is what I ended up using:

from bs4 import BeautifulSoup
import copy

html = """<body>
    <div name='tag-i-want'>
        <span>I don't want this</span>
    </div>
</body>"""

soup = BeautifulSoup(html, 'lxml')

def remove_children(tag):
    tag.clear()
    return tag

divs = [str(remove_children(copy.copy(i))) for i in soup.find_all('div')]