Home > database >  Selenium Couldn't fetch all div tag
Selenium Couldn't fetch all div tag

Time:03-28

I tried to fetch all special div tag with a class of "someClass" from a website

Website need to scroll down to load new div elements so I used Keys.PAGE_DOWN, that worked and scrolled but the data wasn't complete again

So I used:

elem = driver.find_element(By.TAG_NAME, "body")


no_of_pagedowns = 23

while no_of_pagedowns:
    elem.send_keys(Keys.PAGE_DOWN)
    time.sleep(0.7)
    no_of_pagedowns-=1

It will Scroll till the entire html page load but when I want to write data in a file, it just write 20 div tag instead of hundred ...

Complete Code:

import time
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager


driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()))

url = 'https://divar.ir/s/tehran/buy-apartment/parand?price=200000000-450000000&non-negotiable=true&has-photo=true&q=خانه پرند'
driver.get(url)


elem = driver.find_element(By.TAG_NAME, "body")


no_of_pagedowns = 23

while no_of_pagedowns:
    elem.send_keys(Keys.PAGE_DOWN)
    time.sleep(0.3)
    no_of_pagedowns-=1

datas = driver.find_elements(By.CLASS_NAME, 'kt-post-card__body')

f = open('data.txt', 'w')
counter = 1
for data in range(len(datas)):
    f.write(f'{counter}--> {datas[data].text}')
    counter  = 1
    f.write('\n')

f.close()
driver.quit()

CodePudding user response:

To select only 20 <div> tag instead of hundreds you can use list slicing and you can use either of the following locator strategies:

  • Using CSS_SELECTOR

    elements = driver.find_elements(By.CSS_SELECTOR, "div.kt-post-card__body")[:20]
    
  • Using XPATH:

    elements = driver.find_elements(By.XPATH, "//div[@class='kt-post-card__body']")[:20]
    

Ideally you have to induce WebDriverWait for the visibility_of_all_elements_located() and you can use either of the following locator strategies:

  • Using CSS_SELECTOR

    elements = WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "div.kt-post-card__body")))[:20]
    
  • Using XPATH:

    elements = WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.XPATH, "//div[@class='kt-post-card__body']")))[:20]
    

Update

To select all the <div>s:

To select all the <div>s you can use list slicing and you can use either of the following locator strategies:

  • Using CSS_SELECTOR

    elements = driver.find_elements(By.CSS_SELECTOR, "div.kt-post-card__body")
    
  • Using XPATH:

    elements = driver.find_elements(By.XPATH, "//div[@class='kt-post-card__body']")
    

CodePudding user response:

I checked the site and as I figured out they get data as json by using api and cursor. Cursor here made with a time expression and the variable called last-post-date. When entered to the site this value is given as lastPostDate inside a json. To obtain data fastly from the site this requests can be used: https://divar.ir/s/tehran/buy-apartment/parand?price=200000000-450000000&non-negotiable=true&has-photo=true&q=خانه پرند lastPostDate value should be taken from this link and lastPostDate value in the JSON below should be updated with it.

{"json_schema":{"category":{"value":"apartment-sell"},"districts":{"vacancies":["427"]},"price":{"max":450000000,"min":200000000},"non-negotiable":true,"has-photo":true,"query":"خانه پرند"},"last-post-date":1647005920188580}

This updated JSON should be sent to API link below as POST. https://api.divar.ir/v8/search/1/apartment-sell

And a new JSON should be returned. And inside this JSON there is a “last_post_date”. New queries could be made by using this variable. Also required data is stored in this JSON.

This is just an idea. It seems to be working when I test it with POSTMAN.

  • Related