Access href link value from pandas dataframe-CodePudding

I currently have a dataframe I've produced through scraping https://www.cve.org/downloads.

 Format Unix Compressed (.Z)           Gzipped            Raw                                   Additional Notes
0    CSV       allitems.csv.Z   allitems.csv.gz   allitems.csv  NOTE: suitable for import into spreadsheet pro...
1   HTML      allitems.html.Z  allitems.html.gz  allitems.html                                                NaN
2   Text       allitems.txt.Z   allitems.txt.gz   allitems.txt                                                NaN
3    XML       allitems.xml.Z   allitems.xml.gz   allitems.xml                     XML Schema Design: cve_1.0.xsd

Under the Raw column, allitems.csv is actually a link in the website. Once I display it into a dataframe, the href value of the link could no longer be accessed. Below is the code I currently have using selenium and pandas:

import pandas as pd
from selenium import webdriver


# from selenium import webdriver
Browser = webdriver.Safari()

# # To navigate to a URL:
Browser.get("http://cve.org/downloads")

# # To get raw html string:
RawHtmlString = Browser.page_source

df = pd.read_html(RawHtmlString)[0]

print(df)

How do I edit my program to be capable to extract the link and automatically download it?

CodePudding user response：

Get links

If you really want to extract the links, you could first get all the a tags nested inside td with attr data-label="Raw", and then loop through them and get the hrefs. E.g.

raw = Browser.find_elements("xpath", "//td[@data-label='Raw']/a")

links = [r.get_attribute('href') for r in raw]

print(links)
['https://cve.mitre.org/data/downloads/allitems.csv', 
 'https://cve.mitre.org/data/downloads/allitems.html', 
 'https://cve.mitre.org/data/downloads/allitems.txt', 
 'https://cve.mitre.org/data/downloads/allitems.xml']

But if you're only interested in the csv, you could use:

csvs = Browser.find_elements(
    "xpath", "//td[@data-label='Raw']/a[contains(@href,'.csv')]")
links = [csv.get_attribute('href') for csv in csvs]

# or just use `find_element`, seeing that there is only one such file:

csv_link = Browser.find_element(
    "xpath", "//td[@data-label='Raw']/a[contains(@href,'.csv')]")\
    .get_attribute('href')

Of course, in this particular case, these would be quite pointless exercises. As you can see above, all links actually have the same base url. So, you can also simply create an extra column or something:

BASE = 'https://cve.mitre.org/data/downloads/'
df['Urls'] = BASE   df.Raw

print(df.Urls)
0    https://cve.mitre.org/data/downloads/allitems.csv
1    https://cve.mitre.org/data/downloads/allitems....
2    https://cve.mitre.org/data/downloads/allitems.txt
3    https://cve.mitre.org/data/downloads/allitems.xml
Name: Urls, dtype: object

Download files

For downloading, I would rely on urllib.request. Note the warning, though, in the docs: "[This function] might become deprecated at some point in the future". might... That warning has been around for a while. Try something as follows:

from urllib import request

my_path = 'destination_folder_path/' # mind the "/" at the end!

for l in links:
    fname = l.rsplit('/', maxsplit=1)[1]
    print(l) # just to see what we're downloading
    request.urlretrieve(l, f'{my_path}{fname}')

CodePudding user response：

First you have to access the a href part where the link is located, in order to get this text "/data/downloads/file.csv.gz"

s = requests.Session() 
link = '/data/downloads/file.csv.gz'
baseUrl= 'https://cve.mitre.org/'

Then you you apply something like this

s.get(url=urllib.parse.urljoin(baseurl,file_link),headers=headers)