Home > database >  Connection problems for scraping: start only 1 out of 2 scraping (the other is ignored and started c
Connection problems for scraping: start only 1 out of 2 scraping (the other is ignored and started c

Time:11-03

I am making a small scraping script, for study and personal purposes only (non-profit). I have a problem that is not about scraping, but about the connection (I think, but I access the site without problems. I'm not getting any bad request errors). I have noticed that scraping sometimes works correctly and sometimes it doesn't. Start only 1 out of 2 scraping. Now, however, it doesn't work "halfway" (50% yes and 50% no). Series B is scraped correctly 1 time out of 5-6-7 attempts.

CODE EXPLANATION: The code connects to Tor as a proxy via Firefox. Then start 2 different scrapings with 2 "for" cycles (Series A and Series B). The aim is to simply scrape the names of the two for loops.

PROBLEM: I'm not getting any errors, but Serie B scraping feels like it's being ignored. Only Series A is scraped, no Series B (yet they have the same scraping code). : Days ago both scraping worked correctly, only occasionally it happened that Serie B did not scrape. Now, however, Serie B is correctly scraped 1 time out of 5-6-7 attempts.

Intuitively, I would say that the problem is the Tor connection. I also tried copying and pasting the code for the Tor connection ... entering it for the Series B for loop, so that Series A and Series B each had Tor connection. Initially it worked correctly and both Serie A and Serie B were scraping. In subsequent attempts, Serie B was not scraping.

What's the problem? Python code problem? Tor connection problem with Firefox proxy? Other? What should I change? How can I solve? If the code I wrote is incorrect, what code can I write? Thanks

    ######## TOR CONNECTION WITH FIREFOX ########
    from selenium import webdriver
    from selenium.webdriver.firefox.firefox_profile import FirefoxProfile
    import os
    
    tor_linux = os.popen('/home/james/.local/share/torbrowser/tbb/x86_64/tor-browser_en-US') 
    
    profile = FirefoxProfile('/home/james/.local/share/torbrowser/tbb/x86_64/tor-browser_en-US/Browser/TorBrowser/Data/Browser/profile.default')
    profile.set_preference('network.proxy.type', 1)
    profile.set_preference('network.proxy.socks', '127.0.0.1')
    profile.set_preference('network.proxy.socks_port', 9050)
    profile.set_preference("network.proxy.socks_remote_dns", False) 
    
    profile.update_preferences()
    
    firefox_options = webdriver.FirefoxOptions()
    firefox_options.binary_location = '/usr/bin/firefox' 
    
    driver = webdriver.Firefox(
        firefox_profile=profile, options=firefox_options, 
        executable_path='/usr/bin/geckodriver')
    ########################################################################    
    
    #I need this for subsequent insertion into the database
    Values_SerieA = []
    Values_SerieB = []
    
    
    #### SCRAPING SERIE A ####
    driver.minimize_window()
    driver.get("https://www.diretta.it/serie-a/classifiche/")
    for SerieA in driver.find_elements(By.CSS_SELECTOR, "a[href^='/squadra'][class^='tableCellParticipant__name']"):
        SerieA_text = SerieA.text
        Values_SerieA.append(tuple([SerieA_text])) #inserisco le squadre nell'elenco vuoto Values
        print(SerieA_text)
    driver.close
    enter code here
    
   #### SCRAPING SERIE B ######
    driver.minimize_window()
    driver.get("https://www.diretta.it/serie-b/classifiche/")
    for SerieB in driver.find_elements(By.CSS_SELECTOR, "a[href^='/squadra'][class^='tableCellParticipant__name']"):
        SerieB_text = SerieA.text
        Values_SerieB.append(tuple([SerieB_text])) #inserisco le squadre nell'elenco vuoto Values
        print(SerieB_text)
    driver.close

CodePudding user response:

couple of things worth mentioning:

  • selenium is synchronous so using driver.implicity_wait(2) after requesting a site will give it time to load before your driver starts looking for an element that hasn't loaded into the DOM yet

  • you are trying to minimize the driver window even though the last step you performed was to close the driver window. trying flipping the first two lines of the series B part then put a time.sleep(2) or driver.implicitly_wait(2) immediately after

  • i've not used a proxy with the driver so i cannot tell you if that would be creating connection issues. if you're able to get to the site without getting some sort of bad request error i would assume the connection isn't the problem

=== try this out ===

#### SCRAPING SERIE A ####

# request site
    driver.get("https://www.diretta.it/serie-a/classifiche/")

# wait for it to load
    driver.implicitly_wait(2)

# once you're sure page is loaded, minimize window
    driver.minimize_window()

    for SerieA in driver.find_elements(By.CSS_SELECTOR, "a[href^='/squadra'][class^='tableCellParticipant__name']"):
        SerieA_text = SerieA.text
        Values_SerieA.append(tuple([SerieA_text])) #inserisco le squadre nell'elenco vuoto Values
        print(SerieA_text)
    driver.close()
    
   #### SCRAPING SERIE B ######

# request the site
    driver.get("https://www.diretta.it/serie-b/classifiche/")

# wait for everything to load
    driver.implicitly_wait(2)

# once you're sure the window is loading correctly you can move
# this back up to happen before the wait
    driver.minimize_window()

    for SerieB in driver.find_elements(By.CSS_SELECTOR, "a[href^='/squadra'][class^='tableCellParticipant__name']"):
        SerieB_text = SerieA.text
        Values_SerieB.append(tuple([SerieB_text])) #inserisco le squadre nell'elenco vuoto Values
        print(SerieB_text)
    driver.close
  • Related