With Selenium and BeautifulSoap i am trying to scrape a WebPage. In general this works fine. Please find the code below.
On this page there are listed some Categories. The depth is 4 levels. On each level i have 20 items/links.
My question is: what is the most efficient way to open and process these links within an loop?
import sys
sys.path.insert(0,'/usr/lib/chromium-browser/chromedriver')
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from bs4 import BeautifulSoup
import time
options = webdriver.ChromeOptions()
options.add_argument('--headless')
options.add_argument('--no-sandbox')
options.add_argument('--disable-dev-shm-usage')
wd = webdriver.Chrome('chromedriver',options=options)
wd.get("url")
source = wd.page_source
soup = BeautifulSoup(source, "html.parser")
items = soup.select('ul[data-card-id="tree-list0972"]')
for item in items:
ul = item.find('ul')
for li in ul:
print(li.a.get('href') ',' li.a.text)
cats = webdriver.Chrome('chromedriver',options=options)
options = webdriver.ChromeOptions()
options.add_argument('--headless')
options.add_argument('--no-sandbox')
options.add_argument('--disable-dev-shm-usage')
# Here i do need to open the link from the url list (3 levels deep)
cats.get(h domain li.a.get('href'))
WebDriverWait(webdriver, timeout=3)
cats.close
wd.close
CodePudding user response:
I would probably try implement your use case without BeautifulSoap in a structure like this:
1. create web driver
wd = webdriver.Chrome('chromedriver',options=options)
2. open the "main" web page
wd.get("url")
3. get all elements
elements = wd.find_elements_by_css_selector('ul[data-card-id="..."])
4. get the url of each element
pages = []
for element in elements:
pages.append(element.get_attribute('href')
5. process each page
for page in pages:
wd.get(page)
# ...