I'm wanting to write a python script using Selenium to scrape a website. Following along with the Real Python article on it, I literally copy and pasted the following code into a py file:
from selenium.webdriver import Firefox
from selenium.webdriver.firefox.options import Options
opts = Options()
opts.set_headless()
assert opts.headless # Operating in headless mode
browser = Firefox(options=opts)
browser.get('https://duckduckgo.com')
Running the script I get the following error:
opts.set_headless()
AttributeError: 'Options' object has no attribute 'set_headless'
Attempted to follow this article and commented out the opts.set_headless()
attribute and added opts.headless = True
but now I get the following error:
Traceback (most recent call last):
File "/home/usr/local/folder/scraper.py", line 10, in <module>
browser = Firefox(options=opts)
File "/home/usr/local/folder/scraper/venv/lib/python3.10/site-packages/selenium/webdriver/firefox/webdriver.py", line 192, in __init__
self.service.start()
File "/home/usr/local/folder/scraper/venv/lib/python3.10/site-packages/selenium/webdriver/common/service.py", line 106, in start
self.assert_process_still_running()
File "/home/usr/local/folder/scraper/venv/lib/python3.10/site-packages/selenium/webdriver/common/service.py", line 119, in assert_process_still_running
raise WebDriverException(f"Service {self.path} unexpectedly exited. Status code was: {return_code}")
selenium.common.exceptions.WebDriverException: Message: Service geckodriver unexpectedly exited. Status code was: -6
I verified that the geckodriver is located in my $PATH so I have no idea why none of this isn't working. I am using selenium v4.7.2.
CodePudding user response:
This should work:
from selenium import webdriver
from selenium.webdriver.firefox.options import Options
opts = Options()
opts.headless = True
browser = Firefox(options=opts, executable_path='C:\TheActualPathToThe\geckodriver.exe')
browser.get('https://duckduckgo.com')
CodePudding user response:
After much hair pulling, I was able to determine that almost all articles on the internet dealing with Selenium use deprecated methods and attributes. Hopefully this answer will help many others who have been trying to use this library.
First, the .set_headless()
method is fully deprecated and doesn't work. The Python Forums had a helpful discussion around it. In order to use a headless browser, you need to use .add_argument("--headless")
and not any other way.
Second, there is now the Service()
class that needs to be imported and used for any executable_path=
pointing to the geckodriver and any other paths such as logs. These two posts helped on this matter: stackoverflow_1 and stackoverflow_2.
Third, after fixing the code and using the correct modules, attributes, methods and arguments, it was still getting hung up. Searching the logs was pointing to a socket timeout and an issue that was being dealt with the dev team in Sep 2022. This helped me realize that the geckodriver version linked in the original Real Python article I was using was long outdated and needed to be updated to the latest version, which is v0.32.0 at the time of writing.
However, that wasn't why it was getting hung up. I decided to comment out the headless argument and that showed that the Firefox browser was the issue. Apparently, with ubuntu 22.04, Firefox is installed by default with snap and needs to be installed as a .deb file. Here is a good article explaining it.
So ultimately, many different issues with this library and it's constantly being updated with past features, which most articles on the internet use, are all deprecated. The Selenium documentation isn't the greatest either. Here is my final code with the previous issues commented out:
# from selenium import webdriver
from selenium.webdriver import Firefox
from selenium.webdriver.firefox.service import Service
from selenium.webdriver.firefox.options import Options
# Setup--
options = Options()
options.add_argument("--headless")
service = Service(executable_path="/home/$PATH/location/geckodriver.exe", log_path="/home/file/location/log/geckodriver.log")
# caps = webdriver.DesiredCapabilities().FIREFOX
# caps["marionette"] = True
### Deprecated
browser = Firefox(service=service, options=options)
# browser = webdriver.Firefox(firefox_profile=options, capabilities=caps, executable_path="~/bin/geckodriver.exe")
# Parse--
browser.get('https://duckduckgo.com')
logo = browser.find_element(by=id, value='logo_homepage_link')
print(logo[0].text)
browser.quit()