Parsing inline elements of a paragraph with BeautifulSoup & Python-CodePudding

I have some text with some inlined span elements (icons)

<p>
  <span ></span>
  Some text...
  <span ></span>
  some more text.
</p>

I need to write a function that will get me the text of this paragraph however I need to convert the icons to text.

<span ></span> => '{water}'
<span ></span> => '{steel}'

Resulting string should be like:

{water} Some text... {steel} some more text.

How can I do this with python/bs4?

CodePudding user response：

Looks like what I was looking for is .contents which outputs

[<span ></span>,
 'Some text...',
 <span ></span>,
 'some more text...']

Which gets me in the ballpark of what I'm looking for. Now I can just loop over the elements and apply my desired transformation.

CodePudding user response：

There are several ways to accomplish this. I'll be using the following sample HTML and soup to demonstrate each of the methods:

sampleHtml = '''
<p>
  <span ></span>
  Some text...
  <span ></span>
  some more text.
</p>
<p>
  Para 2 text...
  <span >span text</span>
  more Para 2 text.
</p>
'''
soup = BeautifulSoup(sampleHtml)

If you're sure the span tags will be empty, or if you want to over-write the text inside them, then you can set the .string property or use the .replace_with method to replace the span tags entirely.

for p in soup.find_all('p'):
    for s in p.select('span.icon[class*="icon-"]'):
        ci = [c for c in s.get('class') if 'icon-' in c][0]
        s.string = f" {{{ci.replace('icon-', '')}}} " ## STRING METHOD 
        # s.replace_with(f" {{{ci.replace('icon-', '')}}} ") ## REPLACE METHOD 
    print(p.get_text(' ', strip=True))

print('\n\n########################### NEW SOUP ###########################\n')
print(soup)

printed output with the first method (setting .string):

{water} Some text... {steel} some more text.
Para 2 text... {fire} more Para 2 text.


########################### NEW SOUP ###########################

<html><body><p>
<span > {water} </span>
  Some text...
  <span > {steel} </span>
  some more text.
</p>
<p>
  Para 2 text...
  <span > {fire} </span>
  more Para 2 text.
</p>
</body></html>

printed output with the replace method:

{water} Some text... {steel} some more text.
Para 2 text... {fire} more Para 2 text.


########################### NEW SOUP ###########################

<html><body><p>
 {water} 
  Some text...
   {steel} 
  some more text.
</p>
<p>
  Para 2 text...
   {fire} 
  more Para 2 text.
</p>
</body></html>

If the span tags might contain text and you want to preserve that text, you can use one of the .insert methods or the .append method

for p in soup.find_all('p'):
    for s in p.select('span.icon[class*="icon-"]'):
        ci = [c for c in s.get('class') if 'icon-' in c][0]
        # s.insert_before(f" {{{ci.replace('icon-', '')}}} ")
        # s.insert_after(f" {{{ci.replace('icon-', '')}}} ")
        # s.insert(0, f" {{{ci.replace('icon-', '')}}} ")
        s.append(f" {{{ci.replace('icon-', '')}}} ")
    print(p.get_text(' ', strip=True))

print('\n\n########################### NEW SOUP ###########################\n')
print(soup)

printed output (with the append method):

{water} Some text... {steel} some more text.
Para 2 text... span text {fire} more Para 2 text.


########################### NEW SOUP ###########################

<html><body><p>
<span > {water} </span>
  Some text...
  <span > {steel} </span>
  some more text.
</p>
<p>
  Para 2 text...
  <span >span text {fire} </span>
  more Para 2 text.
</p>
</body></html>

If you want to avoid altering soup, then you could define a function like

def icon_or_str(pdesc):
    if 'NavigableString' in str(type(pdesc)): return str(pdesc)
    if getattr(pdesc, 'name') != 'span': return ''
    pdc = pdesc.get('class', []) 
    pci = [c for c in pdc if 'icon-' in c]
    if not (pci and 'icon' in pdc) : return ''
    return f" {{{pci[0].replace('icon-', '')}}} "

and then use it as below:

for p in soup.find_all('p'):
    pTxt = ' '.join([icon_or_str(d) for d in p.descendants]) 
    print(' '.join(w for w in pTxt.split() if w)) 
    ## [remove excess whitespace by splitting and re-joining "words"]

printed output:

{water} Some text... {steel} some more text.
Para 2 text... {fire} span text more Para 2 text.

(Personally, I feel that this last method is a bit cumbersome compared to the others.)