Hi I am trying to get the item 377
after the sold where it is followed by a < !-- -- >
. How do i do so?I got 2 items with the following code. I added space so that it's visible.
sold = soup.find_all('span', {"class":"jsx-302 jsx-385"})
Result:
<span jsx-302 jsx-385""><span jsx-302 jsx-385 sold-text"">Sold</span> < !-- -- >377</span>,
<span jsx-302 jsx-385"">Rp41,400 / 100 g</span>
I can do a regex to get only the first items[0].text
containing sold and ignore the rest. However is there a way to handle span with < !-- -- >
that is in brackets?
CodePudding user response:
Would agree to use split()
but HTML
look not that valid, so behavior of < !-- -- >
or <!-- -->
is not clear.
In case of < !-- -- >
:
soup.select_one('span:has(.sold-text)').text.split('>')[-1]
In case of <!-- -->
:
soup.select_one('span:has(.sold-text)').text.split(' ')[-1]
I would recommend to filter
for digits:
''.join(filter(str.isdigit, soup.select_one('span:has(.sold-text)').text))
Example
from bs4 import BeautifulSoup,Comment
html = '''
<span "><span >Sold</span> < !-- -- >377</span>
<span >Rp41,400 / 100 g</span>
'''
soup=BeautifulSoup(html,'html.parser')
sold = ''.join(filter(str.isdigit, soup.select_one('span:has(.sold-text)').text))
print(sold)
Output
377
CodePudding user response:
You can get the value 377
easily using split()
method as follows:
doc='''
<span ><span >Sold < !-- -- >377
'''
from bs4 import BeautifulSoup
soup=BeautifulSoup(doc,'html.parser')
for sold in soup.find_all('span', {"class":"jsx-302 jsx-385"}):
sold=sold.text
sold=sold.split('>')[-1]
print(sold)
Output:
377