I currently have a dataframe I've produced through scraping https://www.cve.org/downloads.
Format Unix Compressed (.Z) Gzipped Raw Additional Notes
0 CSV allitems.csv.Z allitems.csv.gz allitems.csv NOTE: suitable for import into spreadsheet pro...
1 HTML allitems.html.Z allitems.html.gz allitems.html NaN
2 Text allitems.txt.Z allitems.txt.gz allitems.txt NaN
3 XML allitems.xml.Z allitems.xml.gz allitems.xml XML Schema Design: cve_1.0.xsd
Under the Raw column, allitems.csv is actually a link in the website. Once I display it into a dataframe, the href
value of the link could no longer be accessed. Below is the code I currently have using selenium and pandas:
import pandas as pd
from selenium import webdriver
# from selenium import webdriver
Browser = webdriver.Safari()
# # To navigate to a URL:
Browser.get("http://cve.org/downloads")
# # To get raw html string:
RawHtmlString = Browser.page_source
df = pd.read_html(RawHtmlString)[0]
print(df)
How do I edit my program to be capable to extract the link and automatically download it?
CodePudding user response:
Get links
If you really want to extract the links, you could first get all the a
tags nested inside td
with attr data-label="Raw"
, and then loop through them and get the hrefs
. E.g.
raw = Browser.find_elements("xpath", "//td[@data-label='Raw']/a")
links = [r.get_attribute('href') for r in raw]
print(links)
['https://cve.mitre.org/data/downloads/allitems.csv',
'https://cve.mitre.org/data/downloads/allitems.html',
'https://cve.mitre.org/data/downloads/allitems.txt',
'https://cve.mitre.org/data/downloads/allitems.xml']
But if you're only interested in the csv
, you could use:
csvs = Browser.find_elements(
"xpath", "//td[@data-label='Raw']/a[contains(@href,'.csv')]")
links = [csv.get_attribute('href') for csv in csvs]
# or just use `find_element`, seeing that there is only one such file:
csv_link = Browser.find_element(
"xpath", "//td[@data-label='Raw']/a[contains(@href,'.csv')]")\
.get_attribute('href')
Of course, in this particular case, these would be quite pointless exercises. As you can see above, all links actually have the same base url. So, you can also simply create an extra column or something:
BASE = 'https://cve.mitre.org/data/downloads/'
df['Urls'] = BASE df.Raw
print(df.Urls)
0 https://cve.mitre.org/data/downloads/allitems.csv
1 https://cve.mitre.org/data/downloads/allitems....
2 https://cve.mitre.org/data/downloads/allitems.txt
3 https://cve.mitre.org/data/downloads/allitems.xml
Name: Urls, dtype: object
Download files
For downloading, I would rely on urllib.request
. Note the warning, though, in the docs: "[This function] might become deprecated at some point in the future". might... That warning has been around for a while. Try something as follows:
from urllib import request
my_path = 'destination_folder_path/' # mind the "/" at the end!
for l in links:
fname = l.rsplit('/', maxsplit=1)[1]
print(l) # just to see what we're downloading
request.urlretrieve(l, f'{my_path}{fname}')
CodePudding user response:
First you have to access the a href
part where the link is located, in order to get this text "/data/downloads/file.csv.gz"
s = requests.Session()
link = '/data/downloads/file.csv.gz'
baseUrl= 'https://cve.mitre.org/'
Then you you apply something like this
s.get(url=urllib.parse.urljoin(baseurl,file_link),headers=headers)