Home > front end >  parse a website with beautiful soup - attempting to parse value unsuccessfully
parse a website with beautiful soup - attempting to parse value unsuccessfully

Time:09-07

Hi everyone i am parsing an html doc with beautifulsoup. However, one area of information I cant seem to parse:

the html:

<small>
<span >CVE-2019-11198</span>
<span >6.1 - Medium</span>
- August 05, 2019
</small>

I am parsing this whole block, but want to parse the CVE-2019-11198 , 6.1 , Medium , and August 05, 2019 as separate values. Instead im getting the whole block under <small> with the following code:

original:

cves=soup.find_all("div", class_="cve_listing")
for cve in cves:
    #CVE, vuln numeric rating, vuln sev cat, vuln date
    vulninfo=cve.find("small").text

updated:

cves=soup.find_all("div", class_="cve_listing")
for cve in cves:
    vulncve=cve.find("span", class_="label-primary")
    vulninfo=cve.select_one('span.label').parent
    vulninfores=[x.get_text(strip=True) for x in vulninfo.contents if len(x.text) > 1]

outputs:

AttributeError: 'NavigableString' object has no attribute 'text'

any thoughts on how to parse this efficiently?

CodePudding user response:

You need a bit modify your question.

  1. You have selected "div", class_="cve_listing" but didn't show the html

  2. You can't invoke get_text() and contents method at the same time. Try the below code:

Example:

cves=soup.find_all("div", class_="cve_listing")
for cve in cves:
    vulncve=cve.find("span", class_="label-primary")
    vulninfo=cve.select_one('span.label')
    vulninfores=[x.get_text(strip=True) for x in soup.select(".cve_listing small")][-1]
   

CodePudding user response:

Not having the url of the actual page means I cannot test it, but supposing the html is correct and you can reach it as stated in your question, this is one way of getting that info:

from bs4 import BeautifulSoup

html = '''
<small>
<span >CVE-2019-11198</span>
<span >6.1 - Medium</span>
- August 05, 2019
</small>
'''

soup = BeautifulSoup(html, 'html.parser')
data = soup.select_one('span.label').parent
desired_result = [x.get_text(strip=True) for x in data.contents if len(x.text) > 1]
print(desired_result)

Result:

['CVE-2019-11198', '6.1 - Medium', '- August 05, 2019']

BeautifulSoup documentation: https://beautiful-soup-4.readthedocs.io/en/latest/index.html

  • Related