I am trying to scrape some content from different urls. The first URL looks like (fake examples): "https://something.com/something-1", the second: "https://something.com/somethings-2-mix" etc and the third might go back looking like the first one. I am trying to build in exceptions so that my code can handle these different urls. I have the following code till now which of course does not work - as I think the code is not handling the 404 exceptions as it should.
url1 = 'https://somthing.com/something-'
url2 = 'https://somthing.com/somethings-'
url3 = '-mix'
for r in list(range(1,2)):
url = f'{url1}{r}'
browser.get(main_url)
soup = BeautifulSoup(browser.page_source, "html.parser")
if "404" in soup.find("body").text:
browser.quit()
urlalt = f'{url2}{r}{url3}'
browser.get(urlalt)
Any ideas/suggestions/answers will be very much appreciated. I apologise in advance if my search here hasn't been exhaustive. Thank you very much!
CodePudding user response:
Unfortunately selenium
API doesn't natively include a feature to catch 404 errors and probably won't implement it also based on this issue since it falls out of its scope. A very solid way to find a 404 page would be using python requests
by making a simple get for each url and check the status code from there like so:
requests.get(url).status_code
This of course wouldn't be much effective performance-wise but would be ok I guess for a few urls. The other solution would be to manually scrape the page and search for 404 indicators e.g. page-title etc... but that needs some effort on your side to make sure it works as expected.