I am making a small scraping script, for study and personal purposes only (non-profit). I have a problem that is not about scraping, but about the connection (I think, but I access the site without problems. I'm not getting any bad request errors). I have noticed that scraping sometimes works correctly and sometimes it doesn't. Start only 1 out of 2 scraping. Now, however, it doesn't work "halfway" (50% yes and 50% no). Series B is scraped correctly 1 time out of 5-6-7 attempts.
CODE EXPLANATION: The code connects to Tor as a proxy via Firefox. Then start 2 different scrapings with 2 "for" cycles (Series A and Series B). The aim is to simply scrape the names of the two for loops.
PROBLEM: I'm not getting any errors, but Serie B scraping feels like it's being ignored. Only Series A is scraped, no Series B (yet they have the same scraping code). : Days ago both scraping worked correctly, only occasionally it happened that Serie B did not scrape. Now, however, Serie B is correctly scraped 1 time out of 5-6-7 attempts.
Intuitively, I would say that the problem is the Tor connection. I also tried copying and pasting the code for the Tor connection ... entering it for the Series B for loop, so that Series A and Series B each had Tor connection. Initially it worked correctly and both Serie A and Serie B were scraping. In subsequent attempts, Serie B was not scraping.
What's the problem? Python code problem? Tor connection problem with Firefox proxy? Other? What should I change? How can I solve? If the code I wrote is incorrect, what code can I write? Thanks
######## TOR CONNECTION WITH FIREFOX ########
from selenium import webdriver
from selenium.webdriver.firefox.firefox_profile import FirefoxProfile
import os
tor_linux = os.popen('/home/james/.local/share/torbrowser/tbb/x86_64/tor-browser_en-US')
profile = FirefoxProfile('/home/james/.local/share/torbrowser/tbb/x86_64/tor-browser_en-US/Browser/TorBrowser/Data/Browser/profile.default')
profile.set_preference('network.proxy.type', 1)
profile.set_preference('network.proxy.socks', '127.0.0.1')
profile.set_preference('network.proxy.socks_port', 9050)
profile.set_preference("network.proxy.socks_remote_dns", False)
profile.update_preferences()
firefox_options = webdriver.FirefoxOptions()
firefox_options.binary_location = '/usr/bin/firefox'
driver = webdriver.Firefox(
firefox_profile=profile, options=firefox_options,
executable_path='/usr/bin/geckodriver')
########################################################################
#I need this for subsequent insertion into the database
Values_SerieA = []
Values_SerieB = []
#### SCRAPING SERIE A ####
driver.minimize_window()
driver.get("https://www.diretta.it/serie-a/classifiche/")
for SerieA in driver.find_elements(By.CSS_SELECTOR, "a[href^='/squadra'][class^='tableCellParticipant__name']"):
SerieA_text = SerieA.text
Values_SerieA.append(tuple([SerieA_text])) #inserisco le squadre nell'elenco vuoto Values
print(SerieA_text)
driver.close
enter code here
#### SCRAPING SERIE B ######
driver.minimize_window()
driver.get("https://www.diretta.it/serie-b/classifiche/")
for SerieB in driver.find_elements(By.CSS_SELECTOR, "a[href^='/squadra'][class^='tableCellParticipant__name']"):
SerieB_text = SerieA.text
Values_SerieB.append(tuple([SerieB_text])) #inserisco le squadre nell'elenco vuoto Values
print(SerieB_text)
driver.close
CodePudding user response:
couple of things worth mentioning:
selenium is synchronous so using
driver.implicity_wait(2)
after requesting a site will give it time to load before yourdriver
starts looking for an element that hasn't loaded into the DOM yetyou are trying to minimize the driver window even though the last step you performed was to close the driver window. trying flipping the first two lines of the series B part then put a
time.sleep(2)
ordriver.implicitly_wait(2)
immediately afteri've not used a proxy with the driver so i cannot tell you if that would be creating connection issues. if you're able to get to the site without getting some sort of bad request error i would assume the connection isn't the problem
=== try this out ===
#### SCRAPING SERIE A ####
# request site
driver.get("https://www.diretta.it/serie-a/classifiche/")
# wait for it to load
driver.implicitly_wait(2)
# once you're sure page is loaded, minimize window
driver.minimize_window()
for SerieA in driver.find_elements(By.CSS_SELECTOR, "a[href^='/squadra'][class^='tableCellParticipant__name']"):
SerieA_text = SerieA.text
Values_SerieA.append(tuple([SerieA_text])) #inserisco le squadre nell'elenco vuoto Values
print(SerieA_text)
driver.close()
#### SCRAPING SERIE B ######
# request the site
driver.get("https://www.diretta.it/serie-b/classifiche/")
# wait for everything to load
driver.implicitly_wait(2)
# once you're sure the window is loading correctly you can move
# this back up to happen before the wait
driver.minimize_window()
for SerieB in driver.find_elements(By.CSS_SELECTOR, "a[href^='/squadra'][class^='tableCellParticipant__name']"):
SerieB_text = SerieA.text
Values_SerieB.append(tuple([SerieB_text])) #inserisco le squadre nell'elenco vuoto Values
print(SerieB_text)
driver.close