Connection problems for scraping: start only 1 out of 2 scraping (the other is ignored and started c-CodePudding

I am making a small scraping script, for study and personal purposes only (non-profit). I have a problem that is not about scraping, but about the connection (I think, but I access the site without problems. I'm not getting any bad request errors). I have noticed that scraping sometimes works correctly and sometimes it doesn't. Start only 1 out of 2 scraping. Now, however, it doesn't work "halfway" (50% yes and 50% no). Series B is scraped correctly 1 time out of 5-6-7 attempts.

CODE EXPLANATION: The code connects to Tor as a proxy via Firefox. Then start 2 different scrapings with 2 "for" cycles (Series A and Series B). The aim is to simply scrape the names of the two for loops.

PROBLEM: I'm not getting any errors, but Serie B scraping feels like it's being ignored. Only Series A is scraped, no Series B (yet they have the same scraping code). : Days ago both scraping worked correctly, only occasionally it happened that Serie B did not scrape. Now, however, Serie B is correctly scraped 1 time out of 5-6-7 attempts.

Intuitively, I would say that the problem is the Tor connection. I also tried copying and pasting the code for the Tor connection ... entering it for the Series B for loop, so that Series A and Series B each had Tor connection. Initially it worked correctly and both Serie A and Serie B were scraping. In subsequent attempts, Serie B was not scraping.

What's the problem? Python code problem? Tor connection problem with Firefox proxy? Other? What should I change? How can I solve? If the code I wrote is incorrect, what code can I write? Thanks

    ######## TOR CONNECTION WITH FIREFOX ########
    from selenium import webdriver
    from selenium.webdriver.firefox.firefox_profile import FirefoxProfile
    import os
    
    tor_linux = os.popen('/home/james/.local/share/torbrowser/tbb/x86_64/tor-browser_en-US') 
    
    profile = FirefoxProfile('/home/james/.local/share/torbrowser/tbb/x86_64/tor-browser_en-US/Browser/TorBrowser/Data/Browser/profile.default')
    profile.set_preference('network.proxy.type', 1)
    profile.set_preference('network.proxy.socks', '127.0.0.1')
    profile.set_preference('network.proxy.socks_port', 9050)
    profile.set_preference("network.proxy.socks_remote_dns", False) 
    
    profile.update_preferences()
    
    firefox_options = webdriver.FirefoxOptions()
    firefox_options.binary_location = '/usr/bin/firefox' 
    
    driver = webdriver.Firefox(
        firefox_profile=profile, options=firefox_options, 
        executable_path='/usr/bin/geckodriver')
    ########################################################################    
    
    #I need this for subsequent insertion into the database
    Values_SerieA = []
    Values_SerieB = []
    
    
    #### SCRAPING SERIE A ####
    driver.minimize_window()
    driver.get("https://www.diretta.it/serie-a/classifiche/")
    for SerieA in driver.find_elements(By.CSS_SELECTOR, "a[href^='/squadra'][class^='tableCellParticipant__name']"):
        SerieA_text = SerieA.text
        Values_SerieA.append(tuple([SerieA_text])) #inserisco le squadre nell'elenco vuoto Values
        print(SerieA_text)
    driver.close
    enter code here
    
   #### SCRAPING SERIE B ######
    driver.minimize_window()
    driver.get("https://www.diretta.it/serie-b/classifiche/")
    for SerieB in driver.find_elements(By.CSS_SELECTOR, "a[href^='/squadra'][class^='tableCellParticipant__name']"):
        SerieB_text = SerieA.text
        Values_SerieB.append(tuple([SerieB_text])) #inserisco le squadre nell'elenco vuoto Values
        print(SerieB_text)
    driver.close

CodePudding user response：

couple of things worth mentioning:

selenium is synchronous so using driver.implicity_wait(2) after requesting a site will give it time to load before your driver starts looking for an element that hasn't loaded into the DOM yet
you are trying to minimize the driver window even though the last step you performed was to close the driver window. trying flipping the first two lines of the series B part then put a time.sleep(2) or driver.implicitly_wait(2) immediately after
i've not used a proxy with the driver so i cannot tell you if that would be creating connection issues. if you're able to get to the site without getting some sort of bad request error i would assume the connection isn't the problem

=== try this out ===

#### SCRAPING SERIE A ####

# request site
    driver.get("https://www.diretta.it/serie-a/classifiche/")

# wait for it to load
    driver.implicitly_wait(2)

# once you're sure page is loaded, minimize window
    driver.minimize_window()

    for SerieA in driver.find_elements(By.CSS_SELECTOR, "a[href^='/squadra'][class^='tableCellParticipant__name']"):
        SerieA_text = SerieA.text
        Values_SerieA.append(tuple([SerieA_text])) #inserisco le squadre nell'elenco vuoto Values
        print(SerieA_text)
    driver.close()
    
   #### SCRAPING SERIE B ######

# request the site
    driver.get("https://www.diretta.it/serie-b/classifiche/")

# wait for everything to load
    driver.implicitly_wait(2)

# once you're sure the window is loading correctly you can move
# this back up to happen before the wait
    driver.minimize_window()

    for SerieB in driver.find_elements(By.CSS_SELECTOR, "a[href^='/squadra'][class^='tableCellParticipant__name']"):
        SerieB_text = SerieA.text
        Values_SerieB.append(tuple([SerieB_text])) #inserisco le squadre nell'elenco vuoto Values
        print(SerieB_text)
    driver.close