How to extract Value of a class based on the color of adjacent div?-CodePudding

I am working on a webscrapping project. Have difficulty getting the text of an element with green progress bar. I have attached the html for your reference. I need the progress bar title if adjacent progress bar color is green.

<div >
  <span >
                                                ATTLIST
                                            </span>
  <div >
    <div data-percent="]\\\\\\\\\\\\\\\\\\\\\\'kj;hb p7-yh "  style="background-color: green; width: 100%;">
    </div>
  </div>
  <span >
                                                None of the mentioned
                                            </span>
  <div >
    <div data-percent="100"  style="width: 100%;">
    </div>
  </div>
  <span >
                                                XML
                                            </span>
  <div >
    <div data-percent="100"  style="width: 100%;">
    </div>
  </div>
  <span >
                                                SGML
                                            </span>
  <div >
    <div data-percent="100"  style="width: 100%;">
    </div>
  </div>
</div>

CodePudding user response：

There are couple of ways of accomplishing what you want. I personally prefer using lxml to parse the html, then use xpath to find the "green" div and locate the preceding span with the title:

import lxml.html as lh
prog = """[your html sample above]"""
doc = lh.fromstring(prog)
print(doc.xpath('//div[@][./div[contains(@style,"green")]]/preceding-sibling::span/text()')[0].strip())

Output:

ATTLIST

CodePudding user response：

They are the text nodes value of [span ]. You also can pull them by calling .find(text=True) or .get_text(strip=True) method as follows:

html='''
<html>
 <body>
  <div >
   <span >
    ATTLIST
   </span>
   <div >
    <div  data-percent="]\\\\\\\\\\\'kj;hb p7-yh " style="background-color: green; width: 100%;">
    </div>
   </div>
   <span >
    None of the mentioned
   </span>
   <div >
    <div  data-percent="100" style="width: 100%;">
    </div>
   </div>
   <span >
    XML
   </span>
   <div >
    <div  data-percent="100" style="width: 100%;">
    </div>
   </div>
   <span >
    SGML
   </span>
   <div >
    <div  data-percent="100" style="width: 100%;">
    </div>
   </div>
  </div>
 </body>
</html>

'''

from bs4 import BeautifulSoup

soup = BeautifulSoup(html,'lxml')

#print(soup.prettify())

for span in soup.find_all('span',class_="progressbar-title"):
    txt = span.find(text=True).strip()
    #OR
    #txt = span.get_text(strip=True)
    print(txt)

Output:

ATTLIST
None of the mentioned
XML
SGML