Selenium Python: How to check/simulate [ERROR] exit code to continue a FOR loop instead of exiting c-CodePudding

I currently have a selenium function which does the following summary of the code:

def (list):
    FOR LOOP in list: # Page A (initial), Contains 12

    requests,bs4 grabs element coordinates.  
    [f''string transforms into CSS selector]. # this is the list and loops through this
    selenium.driver opens, detect and selects that element

    FOR LOOP in [f'string...']: # Page B:, Contains 1

        Driver.current url, used to prepare new elements to be detected
        requests,bs4 grabs element coordinates. # this is list and loops through this 
        f''string transforms into CSS selector.
        
        selenium.driver opens, detect and selects that element
        download beginds.
        sleep for .5 sec
        driver goes back to previous page.

Now, my problem is that at predictable iterations, specifically when the for loop B is on its 6/12 element in the list it would crash with the following error code:

'//OBJECT//' is not clickable at point (591, 797). Other element would receive the click: <div style="position: relative" >...</div>
  (Session info: MicrosoftEdge=...)
Stacktrace:
Backtrace:
...

Now I don't have any problem it doing that but I wish it would continue to PAGE B 7/12 and so on, since it does have the Driver.back(). Instead the application stops.

I tried encasing the entire thing with a try and except: PASS, to capture this error. However, it then begins from Page A and still misses the rest.

I would like a method where I could somehow do a 'continue' statement somewhere, but I've only started learning and I ran out of ideas. You can see in the raw code I tried to do a FOR IF: ERROR statement in hopes to put a pass, but that seems like a syntax error. See the raw code below:

import concurrent.futures
import os
import time
import requests
import re

import selenium.common.exceptions
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.common.by import By
import multiprocessing

edge_driver = 'C:selenium\\webdriver\\edge'
os.environ['PATH']  = edge_driver
web_links = {'digital archive': 'https://digital.nmla.metoffice.gov.uk/SO_1118bfbb-f2c9-476f-aa07-eb58b6db5ce6/', }


def scraping_bot(css_selector):
    # First stage: Years
    print('FIRST STAGE INITIATED....')

    driver = webdriver.Edge()
    driver.get(web_links.get('digital archive'))
    year_args = (By.CSS_SELECTOR, f'a[href="{css_selector}"]')
    driver.find_element(*year_args).click()

    # Second Stage: Months
    print('SECOND STAGE INITIATED....')

    sTWO_url = driver.current_url
    sTWO_site = requests.get(sTWO_url)
    sTWO_web_objects = BeautifulSoup(sTWO_site.text, 'lxml')
    monthly_placeholders = sTWO_web_objects.find(name='div', attrs={'class': 'twelve columns last results'})
    months = monthly_placeholders.find_all(name='h5')

    month_css_selector = {}
    for month_href_tags in months:
        month_tag = f'{month_href_tags.get_text()}'
        month_hrefs = re.findall(regex, str(month_href_tags))
        for month_href in month_hrefs:
            month_css_selector.update({month_tag: month_href})

    for v, y in zip(month_css_selector.values(), month_css_selector.keys()):
        print(v)  ##############################
        month_args = (By.CSS_SELECTOR, f'a[href="{v}/"]')
        driver.find_element(*month_args).click()

        # Third Stage: Download
        print(f'THIRD STAGE INITIATED for: {y}: {v}')

        sTWO_url = driver.current_url

        download_site = requests.get(sTWO_url)
        content = BeautifulSoup(download_site.text, 'lxml')
        nav_controls = content.find_all('nav')
        download_button = [nav_controls.find(attrs={'title': 'download'}) for nav_controls in nav_controls]
        download_regex = r'(?<=href=\").{1,}(?=\" title)'
        for button in download_button:
            if button is not None:
                print(button)  ##############################
                downl = re.findall(download_regex, str(button))
                if len(downl) == 1:
                    for downl_button in downl:
                        download_args = (By.CSS_SELECTOR, f'a[href="{downl_button}"]')
                        driver.find_element(*download_args).click()
                    time.sleep(2)
                    print(f'THIRD STAGE DOWNLOAD COMPLETE: {y}; {v}')

                    ##### END OF TREE HERE ####
                    driver.back()  # goes back to Second Stage and so on
                else:
                    print(f'Your download button matches exceeds 1: {len(downl)}')
        if selenium.common.exceptions.ElementClickInterceptedException:
            continue


if __name__ == '__main__':

    sONE_url = requests.get(web_links.get('digital archive'))
    sONE_web_objects = BeautifulSoup(sONE_url.text, 'lxml')

    year_placeholder = sONE_web_objects.find(name='div', attrs={'class': 'sixteen columns results-and-filters'})
    years = year_placeholder.find_all(name='div', attrs={'class': ['one_sixth grey_block new-secondary-background result-item',
                                                                   'one_sixth grey_block new-secondary-background result-item last']})  # don't skip, needed for titles.
    unit = [years.find('h5') for years in years]
    regex = r'(?<=href=\").{1,}(?=\/")'  # lookaround = PositiveLookBehind...PositiveLookAhead

    year_css_selector = []

    titles = [years.get('title') for years in years]
    for year_href_tags, year_tag in zip(unit, titles):  # href_tag -> bs4 component
        hrefs = re.findall(regex, str(year_href_tags.get_text))  # href_tag.get_text -> method that enables str.
        for year_href in hrefs:
            year_css_selector.append(f'{year_href}/')

    for i in year_css_selector:
        scraping_bot(i)

Thus, I wish that my expected output would simply pass or continue that skips this erroneous web-page where I can manually download myself.

UPDATE:

Multiprocessing

I mentioned multiprocessing. Regarding questions that I may be creating and destroying drivers, I was hoping that multiprocessing is simply another pair of hands.

That is, instead of a FOR loop at the end, i can duplicate a process for len of my css_selector (20): but since i only have 4 cores it would be 4 at a time, bust still significantly faster than 1 crawling bot.

SELENIUM Nested FOR LOOP

The website's url is above the raw code for anyone interested. Simply put, the scraping_bot takes care of looping on the last 2 webpages (second-final loops), and the for loop at if name==main singles out the initial webpage (and thats fine).

Its a for loop because, each variable contains 1 'twig', and has to goes back to do the rest initial 11/12 'branches'. Its nested, because the chain of action is normally one after the other, the indents are just there so that it actually continues on that loop.

I made an error that the 'outer loop' and 'inner loop' both contain 12, actually the inner loop has only 1. But because im matching a find_all regex function for the 'CSS selector' for that 'inner loop', i gotta loop through it. Any other way of detecting these elements are buggy for both selenium and bs4...

Thanks for reading

CodePudding user response：

If I understand your issue, I think you just need to but a "try/catch" in the right place, namely surrounding all the code within the for v, y in zip(month_css_selector.values(),block ... : block in function scraping_bot:

import concurrent.futures
import os
import time
import requests
import re

import selenium.common.exceptions
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.common.by import By
import multiprocessing

edge_driver = 'C:selenium\\webdriver\\edge'
os.environ['PATH']  = edge_driver
web_links = {'digital archive': 'https://digital.nmla.metoffice.gov.uk/SO_1118bfbb-f2c9-476f-aa07-eb58b6db5ce6/', }


def scraping_bot(css_selector):
    # First stage: Years
    print('FIRST STAGE INITIATED....')

    driver = webdriver.Edge()
    driver.get(web_links.get('digital archive'))
    year_args = (By.CSS_SELECTOR, f'a[href="{css_selector}"]')
    driver.find_element(*year_args).click()

    # Second Stage: Months
    print('SECOND STAGE INITIATED....')

    sTWO_url = driver.current_url
    sTWO_site = requests.get(sTWO_url)
    sTWO_web_objects = BeautifulSoup(sTWO_site.text, 'lxml')
    monthly_placeholders = sTWO_web_objects.find(name='div', attrs={'class': 'twelve columns last results'})
    months = monthly_placeholders.find_all(name='h5')

    month_css_selector = {}
    for month_href_tags in months:
        month_tag = f'{month_href_tags.get_text()}'
        month_hrefs = re.findall(regex, str(month_href_tags))
        for month_href in month_hrefs:
            month_css_selector.update({month_tag: month_href})

    for v, y in zip(month_css_selector.values(), month_css_selector.keys()):
        try:
            print(v)  ##############################
            month_args = (By.CSS_SELECTOR, f'a[href="{v}/"]')
            driver.find_element(*month_args).click()
    
            # Third Stage: Download
            print(f'THIRD STAGE INITIATED for: {y}: {v}')
    
            sTWO_url = driver.current_url
    
            download_site = requests.get(sTWO_url)
            content = BeautifulSoup(download_site.text, 'lxml')
            nav_controls = content.find_all('nav')
            download_button = [nav_controls.find(attrs={'title': 'download'}) for nav_controls in nav_controls]
            download_regex = r'(?<=href=\").{1,}(?=\" title)'
            for button in download_button:
                if button is not None:
                    print(button)  ##############################
                    downl = re.findall(download_regex, str(button))
                    if len(downl) == 1:
                        for downl_button in downl:
                            download_args = (By.CSS_SELECTOR, f'a[href="{downl_button}"]')
                            driver.find_element(*download_args).click()
                        time.sleep(2)
                        print(f'THIRD STAGE DOWNLOAD COMPLETE: {y}; {v}')
    
                        ##### END OF TREE HERE ####
                        driver.back()  # goes back to Second Stage and so on
                    else:
                        print(f'Your download button matches exceeds 1: {len(downl)}')
        except selenium.common.exceptions.ElementClickInterceptedException:
            # This is sort of expected:
            pass
        except Exception as e:
            # If it is something else, print it out:
            print('Got exception:', e)


if __name__ == '__main__':

    sONE_url = requests.get(web_links.get('digital archive'))
    sONE_web_objects = BeautifulSoup(sONE_url.text, 'lxml')

    year_placeholder = sONE_web_objects.find(name='div', attrs={'class': 'sixteen columns results-and-filters'})
    years = year_placeholder.find_all(name='div', attrs={'class': ['one_sixth grey_block new-secondary-background result-item',
                                                                   'one_sixth grey_block new-secondary-background result-item last']})  # don't skip, needed for titles.
    unit = [years.find('h5') for years in years]
    regex = r'(?<=href=\").{1,}(?=\/")'  # lookaround = PositiveLookBehind...PositiveLookAhead

    year_css_selector = []

    titles = [years.get('title') for years in years]
    for year_href_tags, year_tag in zip(unit, titles):  # href_tag -> bs4 component
        hrefs = re.findall(regex, str(year_href_tags.get_text))  # href_tag.get_text -> method that enables str.
        for year_href in hrefs:
            year_css_selector.append(f'{year_href}/')

    for i in year_css_selector:
        scraping_bot(i)

CodePudding user response：

In light of how useful this post might be for those looking to solve their solution, @Booboo's answer is helpful in most cases. However, regarding the problem of selenium drivers and for-loops I found that correct indentation (as mentioned) and arguments was the culprit.

Specifically, my regex function did not catch instances that url were encased by two scenarios:

where it is:

link rel="alternate" type="application/rss xml" title="Met Office UA » DWS_2003_06 Comments Feed" href="https://digital.nmla.metoffice.gov.uk/IO_e273bcd1-7131-482d-aec0-04755809ec3a/feed/"
where theres additional elements:

a href="https://digital.nmla.metoffice.gov.uk/download/file/IO_efa3ef81-4812-4c8e-a4ab-055b147644d2" title="download"

I found simply changing the download button to include an 'OR' regex argument helped fix this situation:

(?<=href=").{1,}(?=" title|/">)

instead of

(?<=href=").{1,}(?=" title)

...obviously along with the answer posted above.