I am working on a webscrapping project. Have difficulty getting the text of an element with green progress bar. I have attached the html for your reference. I need the progress bar title if adjacent progress bar color is green.
<div >
<span >
ATTLIST
</span>
<div >
<div data-percent="]\\\\\\\\\\\\\\\\\\\\\\'kj;hb p7-yh " style="background-color: green; width: 100%;">
</div>
</div>
<span >
None of the mentioned
</span>
<div >
<div data-percent="100" style="width: 100%;">
</div>
</div>
<span >
XML
</span>
<div >
<div data-percent="100" style="width: 100%;">
</div>
</div>
<span >
SGML
</span>
<div >
<div data-percent="100" style="width: 100%;">
</div>
</div>
</div>
CodePudding user response:
There are couple of ways of accomplishing what you want. I personally prefer using lxml to parse the html, then use xpath to find the "green" div
and locate the preceding span
with the title:
import lxml.html as lh
prog = """[your html sample above]"""
doc = lh.fromstring(prog)
print(doc.xpath('//div[@][./div[contains(@style,"green")]]/preceding-sibling::span/text()')[0].strip())
Output:
ATTLIST
CodePudding user response:
They are the text nodes value of [span ]
. You also can pull them by calling .find(text=True)
or .get_text(strip=True)
method as follows:
html='''
<html>
<body>
<div >
<span >
ATTLIST
</span>
<div >
<div data-percent="]\\\\\\\\\\\'kj;hb p7-yh " style="background-color: green; width: 100%;">
</div>
</div>
<span >
None of the mentioned
</span>
<div >
<div data-percent="100" style="width: 100%;">
</div>
</div>
<span >
XML
</span>
<div >
<div data-percent="100" style="width: 100%;">
</div>
</div>
<span >
SGML
</span>
<div >
<div data-percent="100" style="width: 100%;">
</div>
</div>
</div>
</body>
</html>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html,'lxml')
#print(soup.prettify())
for span in soup.find_all('span',class_="progressbar-title"):
txt = span.find(text=True).strip()
#OR
#txt = span.get_text(strip=True)
print(txt)
Output:
ATTLIST
None of the mentioned
XML
SGML