I am learning web scraping and trying to get the total number of sale pages located at the bottom of this page ("https://www.farfetch.com/uk/shopping/women/sale/all/items.aspx"), but am struggling to do so.
To be precise, I am trying to get the text inside the element: <div data-testid="page-number">1 of 532</div>
I have tried using BeautifulSoup:
Pages = pageSoup.find("div", {"data-testid" : "page-number"}).text
but with no luck.
I then tried using Selenium but I am also struggling to find the class as well. I have tried using driver.find_element(By.XPATH(''))
but with no luck as well.
Apologies if these are stupid questions but I am fairly new to web scraping.
CodePudding user response:
Was able to extract the text through below code.
Need to scroll down to the end of the page so as to extract the details.
May or may not click on Accept Cookies
button to proceed.
# Imports required for Explicit wait.
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
driver.get("https://www.farfetch.com/uk/shopping/women/sale/all/items.aspx")
wait = WebDriverWait(driver,30)
# Click on Accept cookies
wait.until(EC.element_to_be_clickable((By.XPATH,"//button[contains(text(),'Accept All')]"))).click()
# Scroll down to the footer
driver.execute_script("window.scrollBy(0,document.body.scrollHeight)")
driver.execute_script("window.scrollBy(0,-1000);")
page_num = wait.until(EC.presence_of_element_located((By.XPATH,"//div[@data-testid='page-number']")))
print(page_num.text)
1 of 532
CodePudding user response:
So if your page_source
is correct you can get your goal with the following css selector
:
soup.select_one('div[data-testid="page-number"]').text
Example
from bs4 import BeautifulSoup
html='''<div data-testid="page-number">1 of 532</div>'''
soup = BeautifulSoup(html, 'lxml')
soup.select_one('div[data-testid="page-number"]').text
CodePudding user response:
An ideal approach would be to keep scrolling untill your find the element with text as 1 of 532
inducing WebDriverWait for the visibility_of_element_located()
and you can use either of the following Locator Strategies:
Using xpath:
driver.get("https://www.farfetch.com/uk/shopping/women/sale/all/items.aspx") WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.CSS_SELECTOR, "button[data-testid='Button_PrivacySettingsBanner_AcceptAll']"))).click() while True: try: driver.execute_script("window.scrollBy(0,1500)") print(WebDriverWait(driver, 5).until(EC.visibility_of_element_located((By.CSS_SELECTOR, "div[data-testid='page-number']"))).text) break except TimeoutException: continue
driver.quit()
Console Output:
1 of 532