Webscraping with Selenium in Python-CodePudding

I am trying to webscrape the list of DAOs from masari.io but I am having trouble because I get the following errors:

DeprecationWarning: executable_path has been deprecated, please pass in a Service object


driver = webdriver.Chrome(options=options, executable_path=DRIVER_PATH)

DevTools listening on ws://127.0.0.1:56691/devtools/browser/b4609671-5e6e-4d25-b09e-4116b3dde4bf
[0525/100030.252:INFO:CONSOLE(1)] "enabling sentry error tracker", source: https://messari.io/static/js/main.977a4794.chunk.js (1)
[0525/100030.951:INFO:CONSOLE(2)] "Unable to refresh token: Login required", source: https://messari.io/static/js/23.778d04d0.chunk.js (2)
[0525/100031.065:INFO:CONSOLE(2)] "


88b           d88                                                            88
888b         d888                                                            ""
88'8b       d8'88
88 '8b     d8' 88   ,adPPYba,  ,adPPYba,  ,adPPYba,  ,adPPYYba,  8b,dPPYba,  88
88  '8b   d8'  88  a8P_____88  I8[    ""  I8[    ""  ""     'Y8  88P'   "Y8  88
88   '8b d8'   88  8PP"""""""   '"Y8ba,    '"Y8ba,   ,adPPPPP88  88          88
88    '888'    88  "8b,   ,aa  aa    ]8I  aa    ]8I  88,    ,88  88          88
88     '8'     88   '"Ybbd8"'  '"YbbdP"'  '"YbbdP"'  '"8bbdP"Y8  88          88


", source: https://messari.io/static/js/23.778d04d0.chunk.js (2)
[0525/100031.069:INFO:CONSOLE(2)] "Interested in a CHALLENGE? Check out: https://messari.io/quiz", source: https://messari.io/static/js/23.778d04d0.chunk.js (2)
Traceback (most recent call last):
  File "c:/Users/Student/webScrape/scraper.py", line 21, in <module>
    matches = WebDriverWait(driver, 10).until(
  File "C:\Users\Student\AppData\Local\Programs\Python\Python38-32\lib\site-packages\selenium\webdriver\support\wait.py", line 89, in until
    raise TimeoutException(message, screen, stacktrace)
selenium.common.exceptions.TimeoutException: Message:
Stacktrace:
Backtrace:
        Ordinal0 [0x0096B8F3 2406643]
        Ordinal0 [0x008FAF31 1945393]
        Ordinal0 [0x007EC748 837448]
        Ordinal0 [0x008192E0 1020640]
        Ordinal0 [0x0081957B 1021307]
        Ordinal0 [0x00846372 1205106]
        Ordinal0 [0x008342C4 1131204]
        Ordinal0 [0x00844682 1197698]
        Ordinal0 [0x00834096 1130646]
        Ordinal0 [0x0080E636 976438]
        Ordinal0 [0x0080F546 980294]
        GetHandleVerifier [0x00BD9612 2498066]
        GetHandleVerifier [0x00BCC920 2445600]
        GetHandleVerifier [0x00A04F2A 579370]
        GetHandleVerifier [0x00A03D36 574774]
        Ordinal0 [0x00901C0B 1973259]
        Ordinal0 [0x00906688 1992328]
        Ordinal0 [0x00906775 1992565]
        Ordinal0 [0x0090F8D1 2029777]
        BaseThreadInitThunk [0x777BFA29 25]
        RtlGetAppContainerNamedObjectPath [0x77B77A7E 286]
        RtlGetAppContainerNamedObjectPath [0x77B77A4E 238]

I know there is an API for messari.io, but I am almost certain it is only for their assets and not their list of DAOs. I tried using Selenium since it is a dynamic page but I am still having trouble. Here is my code:

from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import requests

url = 'https://messari.io/governor/daos'

DRIVER_PATH = 'PATH_TO_DRIVER_ON_MY_PC'
options = Options()
options.headless = True
options.add_argument("--window-size=1920, 1200")

# s = Service('PATH_TO_DRIVER_ON_MY_PC')
driver = webdriver.Chrome(options=options, executable_path=DRIVER_PATH)
driver.get('https://messari.io/governor/daos')

try:
    matches = WebDriverWait(driver, 10).until(
        EC.presence_of_element_located((By.TAG_NAME, "td")))
    # for match in matches:
    #     print(match.text)

finally:
    driver.quit()

Update I fixed the executable_path warning, but I am still getting the same TimeoutException error. And when I run it without headless I also get the following message:

DevTools listening on ws://127.0.0.1:57773/devtools/browser/4450b78d-3a9f-401a-b39c-2c716ecad924
[9628:20616:0525/102300.840:ERROR:device_event_log_impl.cc(214)] [10:23:00.840] USB: usb_device_handle_win.cc:1049 Failed to read descriptor from node connection: A device attached to the system is not functioning. (0x1F)
[9628:20616:0525/102300.841:ERROR:device_event_log_impl.cc(214)] [10:23:00.841] USB: usb_device_handle_win.cc:1049 Failed to read descriptor from node connection: A device attached to the system is not functioning. (0x1F)

I assume this part is more of a hardware message that I shouldn't worry about based on similar questions bc when I unplugged my mouse it removed one of them.

CodePudding user response：

This page doesn't use <td> to display list of DAOs.
It uses <div> (with CSS) to display it similar to table.

And it keeps name of DAO in <h4>

At least it uses and in my Firefox on laptop with Linux.

Full working code (tested on Linux Mint, Python 3.8, Selenium 4.x, Chrome 101.x)

I used module webdriver_manager so it automatically downloads fresh driver when Linux installs newer version of Chrome

I have to use find_elements() (with s in word elements) or presence_of_all_elements_located() to get all <h4>.

from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

from webdriver_manager.chrome import ChromeDriverManager

url = 'https://messari.io/governor/daos'

options = Options()
options.headless = True
options.add_argument("--window-size=1920, 1200")

driver = webdriver.Chrome(options=options, service=Service(ChromeDriverManager().install()))

driver.get('https://messari.io/governor/daos')

try:
    matches = WebDriverWait(driver, 10).until(
        EC.presence_of_all_elements_located((By.TAG_NAME, "h4")))
    
    #matches = driver.find_elements(By.TAG_NAME, "h4")
    
    for match in matches:
        if match.text:
            print(match.text)
finally:
    driver.quit()

Result:

Fei
Rook
Cosmos
Stargate Finance
Aave
Treasure DAO
DODO
Radicle
Goldfinch
Merit Circle
EPNS
Perpetual Protocol
Gitcoin
SuperRare
Indexed
Doodles
Rome DAO
Badger
Paraswap
Unlock
Terra
Shapeshift
Lobis
Pool Together
The Graph
Yearn Finance
Ampleforth
Alpaca Finance
Balancer
Gro Protocol
Sismo DAO
BeethovenX
ENS
Lido
Alchemist

EDIT:

TO get all values you may have to scroll page - and JavaScript will add new items.

There are answers which use while-loop with execute_script() which use JavaScript code to scroll to the bottom and get current height. If height is different than before scroll then you have to scroll again, but if height is the same then you have end of page and now you can get all items.

CodePudding user response：

***Solution***

With selenium4 as the key executable_path is deprecated you have to use an instance of the Service() class along with ChromeDriverManager().install() command as discussed below

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager

options = Options()
options.add_argument("start-maximized")
driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()), options=options)
driver.get("https://www.google.com")

I hope this will resolve your issue