I use the python package selenium to click the "load more" button automatically, which is successful. But why do I cannot get data after "load more"?
I want to crawl reviews from HP.com using python. It only displays few monitors until I click "load more" button. I use the python package selenium to click the "load more" button automatically, which is successful. But why do I cannot get data after "load more" and just get the first page data repeatedly?
And I tried to use "wait.until(EC.invisibility_of_element_located)" but it didn't run whatever I use... so I chose hardcode like waiting specific seconds. If anyone can help to find how to recode, it'll be super appreciated....
from selenium import webdriver
import time
url = "https://www.hp.com/us-en/shop/plp/accessories/computer-monitors"
headers = {"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.102 Safari/537.36"}
browser = webdriver.Chrome("C:\Python310\chromedriver.exe")
browser.maximize_window()
browser.get("https://www.hp.com/us-en/shop/plp/accessories/computer-monitors")
loadmore = browser.find_element_by_css_selector('#content > div.clearfix.vwa > div.product-results.product-results.left-menu-open > div.search-results > span')
count = 0
while count < 5:
browser.execute_script("window.scrollTo(0, document.body.scrollHeight)")
time.sleep(3)
loadmore.click()
time.sleep(3)
count =1
import requests
import re
from bs4 import BeautifulSoup
res = requests.get(url, headers=headers)
res.raise_for_status()
soup = BeautifulSoup(res.text, "lxml")
prods = soup.find_all("a", attrs={"class":"product-title pdp-link"})
for prod in prods :
print(prod.get_text())
CodePudding user response:
headers = {"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.102 Safari/537.36"}
options = Options()
options.add_argument(f'user-agent={headers}')
driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()),options=options)
#Service("C:\Python310\chromedriver.exe") also works here
driver.maximize_window()
wait=WebDriverWait(driver,10)
driver.get("https://www.hp.com/us-en/shop/plp/accessories/computer-monitors")
wait.until(EC.element_to_be_clickable((By.CSS_SELECTOR,"a.caclose"))).click()
count=0
while count<5:
try:
wait.until(EC.element_to_be_clickable((By.CSS_SELECTOR,"span.search-load-more"))).click()
count =1
except Exception as e:
print(str(e))
break
texts=[x.text for x in wait.until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR,"a.product-title.pdp-link")))]
print(texts)
There's an element proceeding from my region's site so I closed it. If you want to click the load more 5 times or just use while True for until it disappears. Then grab all the text just wait for visibility and use .text.
Imports:
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.chrome.options import Options
from webdriver_manager.chrome import ChromeDriverManager
Outputs:
['HP M24FW FHD MONITOR', 'HP M27F FHD MONITOR', 'HP M24F FHD MONITOR', 'HP V241IB FHD MONITOR', 'HP 27F 27-INCH 4K DISPLAY', 'HP V24 FHD MONITOR', 'HP P22H G4 FHD MONITOR', 'HP U28 4K HDR MONITOR', 'HP Z24U G3 WUXGA USB-C DISPLAY', 'HP V20 HD MONITOR', 'HP E27 G4 FHD MONITOR', 'HP E24Q G4 QHD MONITOR', 'HP 27MQ 27-INCH MONITOR', 'HP E27U G4 QHD USB-C MONITOR', 'HP U27 4K WIRELESS MONITOR', 'HP M27FD FHD MONITOR', 'HP X34 WQHD GAMING MONITOR', 'HP X27C FHD GAMING MONITOR', 'HP X32C FHD GAMING MONITOR']
CodePudding user response:
When you access this site with a fresh browser, after a delay, a dialog asking you to accept their privacy policy will appear. So the solution is to wait for and click the accept button before trying to click on load more.
wait = WebDriverWait(self.driver, 10)
wait.until(EC.element_to_be_clickable((By.CSS_SELECTOR,"button#onetrust-accept-btn-handler"))).click()
For this code to work you need to add the following imports:
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC