I am new to selenium and trying to scrape:-
https://www.asklaila.com/search/Delhi-NCR/-/book-distributor/
I need all the details mentioned on this page an others as well.
Also, there are certain more pages containing the same information, need to scrape them as well. I try to scrape by making changes to the target URL:-
https://www.asklaila.com/search/Delhi-NCR/-/book-distributor/40
but the last item is changing and is not even similar to the page number. Page number 3 is having 40 at the end and page number 5:-
https://www.asklaila.com/search/Delhi-NCR/-/book-distributor/80
so not able to get the data through that.
Here is my code:-
def extract_url():
url = driver.find_elements(By.XPATH,"//h2[@class='resultTitle']//a")
for i in url:
dist.append(i.get_attribute("href"))
driver.execute_script("window.scrollTo(0,document.body.scrollHeight)")
driver.find_element(By.XPATH,"//li[@class='btnNextPre']//a").click()
for _ in range(10):
extract_url()
working fine till page 5 but not after that. Could you please suggest how can I iterate over pages where the we don't know the number of pages and can extract data till teh last page.
CodePudding user response:
You need the check the pagination link is disabled
. Use infinite loop and check for pagination button is disabled.
Use WebDriverWait()
and wait for visibility of the element.
Code:
driver.get("https://www.asklaila.com/search/Delhi-NCR/-/book-distributor/")
counter=1
while(True):
WebDriverWait(driver,20).until(EC.visibility_of_element_located((By.CSS_SELECTOR,"h2.resultTitle >a")))
urllist=[item.get_attribute('href') for item in driver.find_elements(By.CSS_SELECTOR, "h2.resultTitle >a")]
print(urllist)
print("Page number :" str(counter))
driver.execute_script("arguments[0].click();", driver.find_element(By.CSS_SELECTOR, "ul.pagination >li.btnNextPre>a"))
#check for pagination button disabled
if len(driver.find_elements(By.XPATH, "//li[@class='disabled']//a[text()='>']"))>0:
print("pagination not found!!!")
break
time.sleep(2) #To slowdown the loop
counter=counter 1
import below libraries.
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
import time