Exclusion of span with BS4 - Python-CodePudding

So I'm trying to exclude (not extract) the info contained in a span. Here's the HTML:

<li><span>Type:</span> Cardiac Ultrasound</li>

And here's my code:

item_description_infos = listing_soup.find(class_='item_description').find_all('li')
for description_el in item_description_infos: 
        description_elements = description_el.find('span')
        for el in description_elements: 
            curr_el = {}
            key = el.replace(':', '')
            print(el)
            print(description_el.text.replace(' ', ''))

Where listing soup is basically the whole page (in my example the HTML) When I do that I get:

Type:
Type: CardiacUltrasound

As you can see. For some extraordinary reason :P, the span isn't affected by my replace() method even-though .text yields a str

EDIT: Sorry. My objective is to create a bunch of dictionnaries where the key is the span and the value what comes after it.

CodePudding user response：

NOTE: Be careful about "creating a bunch of dictionaries", as dictionaries can't have duplicate keys. But you could have a list of dictionaries, which in that case, won't matter (well still matters within each individual dictionary).

Option 1:

Use .next_sibling()

from bs4 import BeautifulSoup

html = '''
<div >
<li><span>Type:</span> Cardiac Ultrasound</li></div>'''

listing_soup = BeautifulSoup(html, 'html.parser')

item_description_infos = listing_soup.find(class_='item_description').find_all('li')
for description_el in item_description_infos: 
    k = description_el.find('span').text.replace(':', '')
    v = description_el.find('span').next_sibling.strip()
    
    print(k)
    print(v)

Option 2:

Just get the text from description_el, the .split(':'). Then you got the 2 elements you want (if I'm reading your question correctly.

from bs4 import BeautifulSoup

html = '''
<div >
<li><span>Type:</span> Cardiac Ultrasound</li></div>'''

listing_soup = BeautifulSoup(html, 'html.parser')

item_description_infos = listing_soup.find(class_='item_description').find_all('li')
for description_el in item_description_infos: 
    descText = description_el.text.split(':', 1)
    k = descText[0].strip()
    v = descText[-1].strip()
    
    print(k)
    print(v)

Option 3:

Get the <span> text. Remove it. Then get the remaining text in the <li>. Although since you're not wanting to extract, might not be useful to you.

from bs4 import BeautifulSoup

html = '''
<div >
<li><span>Type:</span> Cardiac Ultrasound</li></div>'''

listing_soup = BeautifulSoup(html, 'html.parser')

item_description_infos = listing_soup.find(class_='item_description').find_all('li')
for description_el in item_description_infos: 
    k = description_el.find('span').text.replace(':','')
    description_el.find('span').extract()
    v = description_el.text.strip()
    
    print(k)
    print(v)

Output:

Type
Cardiac Ultrasound

CodePudding user response：

To extract text of tag excluding content of child tags you can use method from this answer. Generally you just need to iterate over <li> tags and get text from ones which contains child <span>.

Code:

from bs4 import BeautifulSoup, NavigableString

html = """<html><body>
<li><span>Key1:</span> Value1</li>
<li><span>Key2:</span> Value2</li>
<li><NoKeyValue</li>
<li><span>Key3:</span> Value3</li>
<li><span>Key4:</span> Value4</li>
</body></html>"""

result = {}
for li in BeautifulSoup(html, "html.parser").find_all("li"):
    span = li.find("span")
    if span:
        result[span.text.strip(" :")] = \
            "".join(e for e in li if isinstance(e, NavigableString)).strip()

You can help my country, check my profile info.