So I'm trying to exclude (not extract) the info contained in a span. Here's the HTML:
<li><span>Type:</span> Cardiac Ultrasound</li>
And here's my code:
item_description_infos = listing_soup.find(class_='item_description').find_all('li')
for description_el in item_description_infos:
description_elements = description_el.find('span')
for el in description_elements:
curr_el = {}
key = el.replace(':', '')
print(el)
print(description_el.text.replace(' ', ''))
Where listing soup is basically the whole page (in my example the HTML) When I do that I get:
Type:
Type: CardiacUltrasound
As you can see. For some extraordinary reason :P, the span
isn't affected by my replace()
method even-though .text
yields a str
EDIT: Sorry. My objective is to create a bunch of dictionnaries
where the key
is the span
and the value
what comes after it.
CodePudding user response:
NOTE: Be careful about "creating a bunch of dictionaries", as dictionaries can't have duplicate keys. But you could have a list of dictionaries, which in that case, won't matter (well still matters within each individual dictionary).
Option 1:
Use .next_sibling()
from bs4 import BeautifulSoup
html = '''
<div >
<li><span>Type:</span> Cardiac Ultrasound</li></div>'''
listing_soup = BeautifulSoup(html, 'html.parser')
item_description_infos = listing_soup.find(class_='item_description').find_all('li')
for description_el in item_description_infos:
k = description_el.find('span').text.replace(':', '')
v = description_el.find('span').next_sibling.strip()
print(k)
print(v)
Option 2:
Just get the text from description_el
, the .split(':')
. Then you got the 2 elements you want (if I'm reading your question correctly.
from bs4 import BeautifulSoup
html = '''
<div >
<li><span>Type:</span> Cardiac Ultrasound</li></div>'''
listing_soup = BeautifulSoup(html, 'html.parser')
item_description_infos = listing_soup.find(class_='item_description').find_all('li')
for description_el in item_description_infos:
descText = description_el.text.split(':', 1)
k = descText[0].strip()
v = descText[-1].strip()
print(k)
print(v)
Option 3:
Get the <span>
text. Remove it. Then get the remaining text in the <li>
. Although since you're not wanting to extract, might not be useful to you.
from bs4 import BeautifulSoup
html = '''
<div >
<li><span>Type:</span> Cardiac Ultrasound</li></div>'''
listing_soup = BeautifulSoup(html, 'html.parser')
item_description_infos = listing_soup.find(class_='item_description').find_all('li')
for description_el in item_description_infos:
k = description_el.find('span').text.replace(':','')
description_el.find('span').extract()
v = description_el.text.strip()
print(k)
print(v)
Output:
Type
Cardiac Ultrasound
CodePudding user response:
To extract text of tag excluding content of child tags you can use method from this answer. Generally you just need to iterate over <li>
tags and get text from ones which contains child <span>
.
Code:
from bs4 import BeautifulSoup, NavigableString
html = """<html><body>
<li><span>Key1:</span> Value1</li>
<li><span>Key2:</span> Value2</li>
<li><NoKeyValue</li>
<li><span>Key3:</span> Value3</li>
<li><span>Key4:</span> Value4</li>
</body></html>"""
result = {}
for li in BeautifulSoup(html, "html.parser").find_all("li"):
span = li.find("span")
if span:
result[span.text.strip(" :")] = \
"".join(e for e in li if isinstance(e, NavigableString)).strip()
You can help my country, check my profile info.