Home > Software engineering >  BeautifulSoup: Search and replace in the text parts of HTML
BeautifulSoup: Search and replace in the text parts of HTML

Time:10-31

I want to do a search and replace on the textual part of the content of the HTML elements.

E.g., replacing foo with <b>bar</b> in

<div id="foo">foo <i>foo</i> hi foo hi</div>

should result in

<div id="foo"><b>bar</b> <i><b>bar</b></i> hi <b>bar</b> hi</div>

I already have a working version in Perl, but the HTML parser there is buggy:

#!/usr/bin/env perl
##
use strict;
use warnings;
use v5.34.0;

use Mojo::DOM;
##
my $input = do { local $/; <STDIN> };

my $dom = Mojo::DOM->new($input);

$dom->descendant_nodes->grep(sub { $_->type eq 'text' })
    ->each(sub{
        $_->replace(s/(sth)/<span >$1<\/span>/gr)
           });

say $dom;

CodePudding user response:

It's not recomended to use string manupulation functions such as .replace & regex on Html strings...As you are looking solution in that area Just writing solution. Orginally we have to do with BeautifulSoup

html = """<div id="foo">foo <i>foo</i> hi foo hi</div>"""
res = html.replace("foo", "<b>bar</b>").replace("<b>bar</b>", "foo", 1)
print(res)

output#

<div id="foo"><b>bar</b> <i><b>bar</b></i> hi <b>bar</b> hi</div>

CodePudding user response:

  1. Search all text nodes containing foo
  2. Create a b element
  3. Replace the text with the new element
  4. Insert the desired text into the b
from bs4 import BeautifulSoup, NavigableString, Tag
import re
import html

htmlString = '''
<div id="foo">foo <i>foo</i> hi foo hi</div>
'''

soup = BeautifulSoup(htmlString, "html.parser")

for n in soup.find_all(text=re.compile('foo')):

    bold = soup.new_tag("b")

    n.replaceWith(bold)
    bold.insert(0, 'bar')

print(soup)

Output:

<div id="foo"><b>bar</b><i><b>bar</b></i><b>bar</b></div>
  • Related