Home > Net >  What is the most efficient way in Selenium to open and process links within a loop?
What is the most efficient way in Selenium to open and process links within a loop?

Time:11-09

With Selenium and BeautifulSoap i am trying to scrape a WebPage. In general this works fine. Please find the code below.

On this page there are listed some Categories. The depth is 4 levels. On each level i have 20 items/links.

My question is: what is the most efficient way to open and process these links within an loop?

import sys
sys.path.insert(0,'/usr/lib/chromium-browser/chromedriver')
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from bs4 import BeautifulSoup
import time

options = webdriver.ChromeOptions()
options.add_argument('--headless')
options.add_argument('--no-sandbox')
options.add_argument('--disable-dev-shm-usage')
wd = webdriver.Chrome('chromedriver',options=options)

wd.get("url")

source = wd.page_source
soup = BeautifulSoup(source, "html.parser")
items = soup.select('ul[data-card-id="tree-list0972"]')
for item in items:
  ul = item.find('ul')
  for li in ul:
    print(li.a.get('href')   ','   li.a.text)
    cats = webdriver.Chrome('chromedriver',options=options)
    options = webdriver.ChromeOptions()
    options.add_argument('--headless')
    options.add_argument('--no-sandbox')
    options.add_argument('--disable-dev-shm-usage')

    # Here i do need to open the link from the url list (3 levels deep)
    cats.get(h   domain   li.a.get('href'))

    WebDriverWait(webdriver, timeout=3)
    cats.close 
wd.close

CodePudding user response:

I would probably try implement your use case without BeautifulSoap in a structure like this:

1. create web driver

wd = webdriver.Chrome('chromedriver',options=options)

2. open the "main" web page

wd.get("url")

3. get all elements

elements = wd.find_elements_by_css_selector('ul[data-card-id="..."])

4. get the url of each element

pages = []
for element in elements:
   pages.append(element.get_attribute('href')

5. process each page

for page in pages:
   wd.get(page)
   # ...
  • Related