Home > Back-end >  Parsing inline elements of a paragraph with BeautifulSoup & Python
Parsing inline elements of a paragraph with BeautifulSoup & Python

Time:12-13

I have some text with some inlined span elements (icons)

<p>
  <span ></span>
  Some text...
  <span ></span>
  some more text.
</p>

I need to write a function that will get me the text of this paragraph however I need to convert the icons to text.

<span ></span> => '{water}'
<span ></span> => '{steel}'

Resulting string should be like:

{water} Some text... {steel} some more text.

How can I do this with python/bs4?

CodePudding user response:

Looks like what I was looking for is .contents which outputs

[<span ></span>,
 'Some text...',
 <span ></span>,
 'some more text...']

Which gets me in the ballpark of what I'm looking for. Now I can just loop over the elements and apply my desired transformation.

CodePudding user response:

There are several ways to accomplish this. I'll be using the following sample HTML and soup to demonstrate each of the methods:

sampleHtml = '''
<p>
  <span ></span>
  Some text...
  <span ></span>
  some more text.
</p>
<p>
  Para 2 text...
  <span >span text</span>
  more Para 2 text.
</p>
'''
soup = BeautifulSoup(sampleHtml)

If you're sure the span tags will be empty, or if you want to over-write the text inside them, then you can set the .string property or use the .replace_with method to replace the span tags entirely.

for p in soup.find_all('p'):
    for s in p.select('span.icon[class*="icon-"]'):
        ci = [c for c in s.get('class') if 'icon-' in c][0]
        s.string = f" {{{ci.replace('icon-', '')}}} " ## STRING METHOD 
        # s.replace_with(f" {{{ci.replace('icon-', '')}}} ") ## REPLACE METHOD 
    print(p.get_text(' ', strip=True))

print('\n\n########################### NEW SOUP ###########################\n')
print(soup)

printed output with the first method (setting .string):

{water} Some text... {steel} some more text.
Para 2 text... {fire} more Para 2 text.


########################### NEW SOUP ###########################

<html><body><p>
<span > {water} </span>
  Some text...
  <span > {steel} </span>
  some more text.
</p>
<p>
  Para 2 text...
  <span > {fire} </span>
  more Para 2 text.
</p>
</body></html>

printed output with the replace method:

{water} Some text... {steel} some more text.
Para 2 text... {fire} more Para 2 text.


########################### NEW SOUP ###########################

<html><body><p>
 {water} 
  Some text...
   {steel} 
  some more text.
</p>
<p>
  Para 2 text...
   {fire} 
  more Para 2 text.
</p>
</body></html>

If the span tags might contain text and you want to preserve that text, you can use one of the .insert methods or the .append method

for p in soup.find_all('p'):
    for s in p.select('span.icon[class*="icon-"]'):
        ci = [c for c in s.get('class') if 'icon-' in c][0]
        # s.insert_before(f" {{{ci.replace('icon-', '')}}} ")
        # s.insert_after(f" {{{ci.replace('icon-', '')}}} ")
        # s.insert(0, f" {{{ci.replace('icon-', '')}}} ")
        s.append(f" {{{ci.replace('icon-', '')}}} ")
    print(p.get_text(' ', strip=True))

print('\n\n########################### NEW SOUP ###########################\n')
print(soup)

printed output (with the append method):

{water} Some text... {steel} some more text.
Para 2 text... span text {fire} more Para 2 text.


########################### NEW SOUP ###########################

<html><body><p>
<span > {water} </span>
  Some text...
  <span > {steel} </span>
  some more text.
</p>
<p>
  Para 2 text...
  <span >span text {fire} </span>
  more Para 2 text.
</p>
</body></html>

If you want to avoid altering soup, then you could define a function like

def icon_or_str(pdesc):
    if 'NavigableString' in str(type(pdesc)): return str(pdesc)
    if getattr(pdesc, 'name') != 'span': return ''
    pdc = pdesc.get('class', []) 
    pci = [c for c in pdc if 'icon-' in c]
    if not (pci and 'icon' in pdc) : return ''
    return f" {{{pci[0].replace('icon-', '')}}} "

and then use it as below:

for p in soup.find_all('p'):
    pTxt = ' '.join([icon_or_str(d) for d in p.descendants]) 
    print(' '.join(w for w in pTxt.split() if w)) 
    ## [remove excess whitespace by splitting and re-joining "words"]

printed output:

{water} Some text... {steel} some more text.
Para 2 text... {fire} span text more Para 2 text.

(Personally, I feel that this last method is a bit cumbersome compared to the others.)

  • Related