Hi everyone i am parsing an html doc with beautifulsoup. However, one area of information I cant seem to parse:
the html:
<small>
<span >CVE-2019-11198</span>
<span >6.1 - Medium</span>
- August 05, 2019
</small>
I am parsing this whole block, but want to parse the CVE-2019-11198
, 6.1
, Medium
, and August 05, 2019
as separate values. Instead im getting the whole block under <small>
with the following code:
original:
cves=soup.find_all("div", class_="cve_listing")
for cve in cves:
#CVE, vuln numeric rating, vuln sev cat, vuln date
vulninfo=cve.find("small").text
updated:
cves=soup.find_all("div", class_="cve_listing")
for cve in cves:
vulncve=cve.find("span", class_="label-primary")
vulninfo=cve.select_one('span.label').parent
vulninfores=[x.get_text(strip=True) for x in vulninfo.contents if len(x.text) > 1]
outputs:
AttributeError: 'NavigableString' object has no attribute 'text'
any thoughts on how to parse this efficiently?
CodePudding user response:
You need a bit modify your question.
You have selected
"div", class_="cve_listing"
but didn't show the htmlYou can't invoke get_text() and contents method at the same time. Try the below code:
Example:
cves=soup.find_all("div", class_="cve_listing")
for cve in cves:
vulncve=cve.find("span", class_="label-primary")
vulninfo=cve.select_one('span.label')
vulninfores=[x.get_text(strip=True) for x in soup.select(".cve_listing small")][-1]
CodePudding user response:
Not having the url of the actual page means I cannot test it, but supposing the html is correct and you can reach it as stated in your question, this is one way of getting that info:
from bs4 import BeautifulSoup
html = '''
<small>
<span >CVE-2019-11198</span>
<span >6.1 - Medium</span>
- August 05, 2019
</small>
'''
soup = BeautifulSoup(html, 'html.parser')
data = soup.select_one('span.label').parent
desired_result = [x.get_text(strip=True) for x in data.contents if len(x.text) > 1]
print(desired_result)
Result:
['CVE-2019-11198', '6.1 - Medium', '- August 05, 2019']
BeautifulSoup documentation: https://beautiful-soup-4.readthedocs.io/en/latest/index.html