I try to scrape product titles from https://www.ternbicycles.com/us/bikes#
I try to use xpath and css selector, but the find_elements() function only returns empty list. I didn't find framesets from this page and use execute_script to scrolled the window down to bottom... oh and I did set up wait time
But still no use....
Could some one help me out. Maybe could help me to try this url and give me some idea Thank you
URL='https://www.ternbicycles.com/us/bikes'
driver.implicitly_wait(10)
driver.get(URL)
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
print(driver.title)
titles=driver.find_elements(By.CSS_SELECTOR,"//a[normalize-space()='Quick Haul P9 Performance']")
print(titles)
#result: Just a moment... []
############################################################# Update at 14:05 08/26/2022 Thank you guys for your reply,,,
i am using colab and i got TimeoutException: Message: Stacktrace... think should be a configuration problem? Do you have any solution?
Here is what i imported before i ran your codes
!apt-get update
!apt install chromium-chromedriver
!cp /usr/lib/chromium-browser/chromedriver /usr/bin
!pip install selenium
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.wait import WebDriverWait
#set up Chrome driver
options=webdriver.ChromeOptions()
options.add_argument('--headless')
options.add_argument('--no-sandbox')
options.add_argument('--disable-dev-shm-usage')
options.add_argument("window-size=1280,720")
#Define web driver as a Chrome driver
driver=webdriver.Chrome('chromedriver',options=options)
driver.implicitly_wait(10)
CodePudding user response:
The webpage isn't dynamic. So you can parse all the required contents/data with the help of bs4 and requests
module.
Example:
import requests
from bs4 import BeautifulSoup
import pandas as pd
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:94.0) Gecko/20100101 Firefox/94.0'}
r = requests.get('https://www.ternbicycles.com/us/bikes',headers =headers)
print(r)
soup = BeautifulSoup(r.text,'html.parser')
for card in soup.select('div[] > div'):
title = card.select_one('h2[] > a').get_text(strip=True)
print(title)
Output:
Quick Haul P9 Performance
Quick Haul D8
NBD S5i
NBD P8i
Short Haul D8
Vektron S10
HSD S11
HSD S8i
HSD P9
Verge X11
Verge S8i
Verge D9
Node D8
Node D7i
BYB P8
Vektron Q9
HSD S
HSD P9 Performance
GSD R14
GSD S00 LX
GSD S10 LX
GSD S10
Verge P10
Eclipse X22
Eclipse P20
Eclipse D16
Link D7i
Link D8
Link C8
Link A7
BYB S11
CodePudding user response:
The following selenium based solution works (tested):
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
chrome_options = Options()
chrome_options.add_argument("--no-sandbox")
chrome_options.add_argument('disable-notifications')
chrome_options.add_argument("window-size=1280,720")
webdriver_service = Service("chromedriver/chromedriver") ## path to where you saved chromedriver binary
browser = webdriver.Chrome(service=webdriver_service, options=chrome_options)
url = 'https://www.ternbicycles.com/us/bikes'
browser.get(url)
try:
WebDriverWait(browser, 20).until(EC.element_to_be_clickable((By.CSS_SELECTOR, ".agree-button.eu-cookie-compliance-agree-button"))).click()
print('accepted cookies')
except Exception as e:
print('no cookies for you')
browser.execute_script("window.scrollTo(0,document.body.scrollHeight);")
bikes = WebDriverWait(browser, 20).until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, "h2[class='font-header-bold pb-4']")))
for b in bikes:
print(b.text)
This will print out:
accepted cookies
Quick Haul P9 Performance
Quick Haul D8
NBD S5i
NBD P8i
Short Haul D8
Vektron S10
HSD S11
HSD S8i
HSD P9
Verge X11
Verge S8i
Verge D9
Node D8
Node D7i
BYB P8
Vektron Q9
HSD S
HSD P9 Performance
GSD R14
GSD S00 LX
GSD S10 LX
GSD S10
Verge P10
Eclipse X22
Eclipse P20
Eclipse D16
[...]
Selenium setup is for python, you can adapt the code to your own setup, just note the imports and the code after defining the browser (driver). Selenium docs: https://www.selenium.dev/documentation/
CodePudding user response:
To extract the name of the product titles you have to induce WebDriverWait for visibility_of_all_elements_located() and using List Comprehension you can use either of the following locator strategies:
Using CSS_SELECTOR and text attribute:
driver.get('https://www.ternbicycles.com/us/bikes#') WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.CSS_SELECTOR, "button.agree-button.eu-cookie-compliance-agree-button"))).click() print([my_elem.text for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "div.title>h2>a")))])
Using XPATH and
get_attribute("innerHTML")
:driver.get('https://www.ternbicycles.com/us/bikes#') WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.XPATH, "//button[@class='agree-button eu-cookie-compliance-agree-button']"))).click() print([my_elem.get_attribute("innerHTML") for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.XPATH, "//div[@class='title']/h2/a")))])
Console Output:
['Quick Haul P9 Performance', 'Quick Haul D8', 'NBD S5i', 'NBD P8i', 'Short Haul D8', 'Vektron S10', 'HSD S11', 'HSD S8i', 'HSD P9', 'Verge X11', 'Verge S8i', 'Verge D9', 'Node D8', 'Node D7i', 'BYB P8', 'Vektron Q9', 'HSD S ', 'HSD P9 Performance', 'GSD R14', 'GSD S00 LX', 'GSD S10 LX', 'GSD S10', 'Verge P10', 'Eclipse X22', 'Eclipse P20', 'Eclipse D16', 'Link D7i', 'Link D8', 'Link C8', 'Link A7', 'BYB S11']
Note : You have to add the following imports :
from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver.common.by import By from selenium.webdriver.support import expected_conditions as EC