Home > database >  Getting Page Number from Site Using BeautifulSoup/Selenium
Getting Page Number from Site Using BeautifulSoup/Selenium

Time:11-17

I am learning web scraping and trying to get the total number of sale pages located at the bottom of this page ("https://www.farfetch.com/uk/shopping/women/sale/all/items.aspx"), but am struggling to do so.

To be precise, I am trying to get the text inside the element: <div data-testid="page-number">1 of 532</div>

I have tried using BeautifulSoup: Pages = pageSoup.find("div", {"data-testid" : "page-number"}).text but with no luck.

I then tried using Selenium but I am also struggling to find the class as well. I have tried using driver.find_element(By.XPATH('')) but with no luck as well.

Apologies if these are stupid questions but I am fairly new to web scraping.

CodePudding user response:

Was able to extract the text through below code.

Need to scroll down to the end of the page so as to extract the details.

May or may not click on Accept Cookies button to proceed.

# Imports required for Explicit wait.

from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By

driver.get("https://www.farfetch.com/uk/shopping/women/sale/all/items.aspx")

wait = WebDriverWait(driver,30)

# Click on Accept cookies
wait.until(EC.element_to_be_clickable((By.XPATH,"//button[contains(text(),'Accept All')]"))).click()

# Scroll down to the footer
driver.execute_script("window.scrollBy(0,document.body.scrollHeight)")
driver.execute_script("window.scrollBy(0,-1000);")

page_num = wait.until(EC.presence_of_element_located((By.XPATH,"//div[@data-testid='page-number']")))
print(page_num.text)
1 of 532

CodePudding user response:

So if your page_source is correct you can get your goal with the following css selector:

soup.select_one('div[data-testid="page-number"]').text

Example

from bs4 import BeautifulSoup

html='''<div  data-testid="page-number">1 of 532</div>'''
soup = BeautifulSoup(html, 'lxml')
soup.select_one('div[data-testid="page-number"]').text

CodePudding user response:

An ideal approach would be to keep scrolling untill your find the element with text as 1 of 532 inducing WebDriverWait for the visibility_of_element_located() and you can use either of the following Locator Strategies:

  • Using xpath:

    driver.get("https://www.farfetch.com/uk/shopping/women/sale/all/items.aspx")
    WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.CSS_SELECTOR, "button[data-testid='Button_PrivacySettingsBanner_AcceptAll']"))).click()
    while True:
        try:
          driver.execute_script("window.scrollBy(0,1500)")
          print(WebDriverWait(driver, 5).until(EC.visibility_of_element_located((By.CSS_SELECTOR, "div[data-testid='page-number']"))).text)
          break
        except TimeoutException:
          continue
    

    driver.quit()

  • Console Output:

    1 of 532
    
  • Related