I have some text with some inlined span elements (icons)
<p>
<span ></span>
Some text...
<span ></span>
some more text.
</p>
I need to write a function that will get me the text of this paragraph however I need to convert the icons to text.
<span ></span> => '{water}'
<span ></span> => '{steel}'
Resulting string should be like:
{water} Some text... {steel} some more text.
How can I do this with python/bs4?
CodePudding user response:
Looks like what I was looking for is .contents
which outputs
[<span ></span>,
'Some text...',
<span ></span>,
'some more text...']
Which gets me in the ballpark of what I'm looking for. Now I can just loop over the elements and apply my desired transformation.
CodePudding user response:
There are several ways to accomplish this. I'll be using the following sample HTML and soup
to demonstrate each of the methods:
sampleHtml = '''
<p>
<span ></span>
Some text...
<span ></span>
some more text.
</p>
<p>
Para 2 text...
<span >span text</span>
more Para 2 text.
</p>
'''
soup = BeautifulSoup(sampleHtml)
If you're sure the span
tags will be empty, or if you want to over-write the text inside them, then you can set the .string
property or use the .replace_with
method to replace the span
tags entirely.
for p in soup.find_all('p'):
for s in p.select('span.icon[class*="icon-"]'):
ci = [c for c in s.get('class') if 'icon-' in c][0]
s.string = f" {{{ci.replace('icon-', '')}}} " ## STRING METHOD
# s.replace_with(f" {{{ci.replace('icon-', '')}}} ") ## REPLACE METHOD
print(p.get_text(' ', strip=True))
print('\n\n########################### NEW SOUP ###########################\n')
print(soup)
printed output with the first method (setting .string
):
{water} Some text... {steel} some more text.
Para 2 text... {fire} more Para 2 text.
########################### NEW SOUP ###########################
<html><body><p>
<span > {water} </span>
Some text...
<span > {steel} </span>
some more text.
</p>
<p>
Para 2 text...
<span > {fire} </span>
more Para 2 text.
</p>
</body></html>
printed output with the replace method:
{water} Some text... {steel} some more text.
Para 2 text... {fire} more Para 2 text.
########################### NEW SOUP ###########################
<html><body><p>
{water}
Some text...
{steel}
some more text.
</p>
<p>
Para 2 text...
{fire}
more Para 2 text.
</p>
</body></html>
If the span
tags might contain text and you want to preserve that text, you can use one of the .insert
methods or the .append
method
for p in soup.find_all('p'):
for s in p.select('span.icon[class*="icon-"]'):
ci = [c for c in s.get('class') if 'icon-' in c][0]
# s.insert_before(f" {{{ci.replace('icon-', '')}}} ")
# s.insert_after(f" {{{ci.replace('icon-', '')}}} ")
# s.insert(0, f" {{{ci.replace('icon-', '')}}} ")
s.append(f" {{{ci.replace('icon-', '')}}} ")
print(p.get_text(' ', strip=True))
print('\n\n########################### NEW SOUP ###########################\n')
print(soup)
printed output (with the append method):
{water} Some text... {steel} some more text.
Para 2 text... span text {fire} more Para 2 text.
########################### NEW SOUP ###########################
<html><body><p>
<span > {water} </span>
Some text...
<span > {steel} </span>
some more text.
</p>
<p>
Para 2 text...
<span >span text {fire} </span>
more Para 2 text.
</p>
</body></html>
If you want to avoid altering soup
, then you could define a function like
def icon_or_str(pdesc):
if 'NavigableString' in str(type(pdesc)): return str(pdesc)
if getattr(pdesc, 'name') != 'span': return ''
pdc = pdesc.get('class', [])
pci = [c for c in pdc if 'icon-' in c]
if not (pci and 'icon' in pdc) : return ''
return f" {{{pci[0].replace('icon-', '')}}} "
and then use it as below:
for p in soup.find_all('p'):
pTxt = ' '.join([icon_or_str(d) for d in p.descendants])
print(' '.join(w for w in pTxt.split() if w))
## [remove excess whitespace by splitting and re-joining "words"]
printed output:
{water} Some text... {steel} some more text.
Para 2 text... {fire} span text more Para 2 text.
(Personally, I feel that this last method is a bit cumbersome compared to the others.)