Home > OS >  How to alternate between urls in python scraping?
How to alternate between urls in python scraping?

Time:08-25

I am trying to scrape some content from different urls. The first URL looks like (fake examples): "https://something.com/something-1", the second: "https://something.com/somethings-2-mix" etc and the third might go back looking like the first one. I am trying to build in exceptions so that my code can handle these different urls. I have the following code till now which of course does not work - as I think the code is not handling the 404 exceptions as it should.

url1 = 'https://somthing.com/something-'
url2 = 'https://somthing.com/somethings-'
url3 = '-mix'

for r in list(range(1,2)):

    url = f'{url1}{r}'
    browser.get(main_url)
    soup = BeautifulSoup(browser.page_source, "html.parser")
    if "404" in soup.find("body").text:
        browser.quit()
        urlalt = f'{url2}{r}{url3}'
        browser.get(urlalt)

Any ideas/suggestions/answers will be very much appreciated. I apologise in advance if my search here hasn't been exhaustive. Thank you very much!

CodePudding user response:

Unfortunately selenium API doesn't natively include a feature to catch 404 errors and probably won't implement it also based on this issue since it falls out of its scope. A very solid way to find a 404 page would be using python requests by making a simple get for each url and check the status code from there like so:

requests.get(url).status_code

This of course wouldn't be much effective performance-wise but would be ok I guess for a few urls. The other solution would be to manually scrape the page and search for 404 indicators e.g. page-title etc... but that needs some effort on your side to make sure it works as expected.

  • Related