Home > Software engineering >  Web scrape and download excel files with Python
Web scrape and download excel files with Python


I've been trying to scrape a website for its excel files. I'm planning on doing this once for the bulk of data it contains from its data archives section. I've been able to download individual files one at a time with urlib requests and tried it on several different files manually. But when I try to create a function to download all of them I've been receiving some errors. The first error that was occurring was just getting the http file addresses as a list. I changed the verify to false (not the best practice for security reasons) to work around the certification ssl error it was giving me and it worked. I then attempted again going further by scrapping and downloading it to a specific folder. I've done this before with a similar project and didn't nearly have this hard of time with certification error ssl.

import requests
from bs4 import BeautifulSoup
import os

os.chdir(r'C:\ The out put path were it will go\\')
url = 'https://pages.stern.nyu.edu/~adamodar/pc/archives/'
reqs = requests.get(url, verify=False)
soup = BeautifulSoup(reqs.text, 'html.parser')
file_type = '.xls'
urls = []
for link in soup.find_all('a'):
    file_link = link.get('href')
    if file_type in file_link:
        with open(link.text, 'wb') as file:
            response = requests.get(url   file_link)

This is the error is has been giving me even after verifying false, which seemed to solve the problem before generating the list. It's grabbing the first file each time tried but it doesn't loop to the next.

requests.exceptions.SSLError: HTTPSConnectionPool(host='pages.stern.nyu.edu', port=443): Max retries exceeded with url: /~adamodar/pc/archives/BAMLEMPBPUBSICRPIEY19.xls (Caused by SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1129)')

What am I missing? I thought I fixed the verification issue.

CodePudding user response:

You forgot to set verify=False when you get your files

urls = []
for link in soup.find_all('a'):
    file_link = link.get('href')
    if file_type in file_link:
        with open(link.text, 'wb') as file:
            response = requests.get(url   file_link, verify=False) # <-- This is where you forgot
  • Related