I'm trying to use read some data from the web but I'm having an unexpected problem. I call it unexpected because if I print the web I'm trying to reading, it exists and it gives no problems. However, when I use the following code (see below) I receive the so-called error "HTTP Error 404: Not Found with an existing url". But the url exists (see here)... Does anyone know what am I doing wrong? Thanks!
import pandas as pd
from bs4 import BeautifulSoup
import urllib.request as ur
index = 'MSFT'
url_is = 'https://finance.yahoo.com/quote/' index '/financials?p=' index
# Readdata
read_data = ur.urlopen(url_is).read()
CodePudding user response:
Some sites require a valid "User-Agent" identifier header. In your example with urllib, as the URL parameter of urlopen can also be a Request object, you could specify the headers in the Request object along with the url, as below:
from urllib.request import Request, urlopen
index = 'MSFT'
url_is = 'https://finance.yahoo.com/quote/' index '/financials?p=' index
req = Request(url_is, headers={'User-Agent': 'Mozilla/5.0'})
html = urlopen(req).read()
CodePudding user response:
Using requests module and injecting User-Agent, response status is 200 as follows:
from bs4 import BeautifulSoup
import requests
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4606.81 Safari/537.36'}
index = 'MSFT'
url_is = 'https://finance.yahoo.com/quote/' index '/financials?p=' index
r = requests.get(url_is, headers=headers)
print(r.status_code)
#page = BeautifulSoup(r.content, 'lxml')
Output:
200