Home > Software engineering >  python webscraping pricing and hidden elements with selenium part2
python webscraping pricing and hidden elements with selenium part2

Time:11-08

So this is a continuation of my previous question, which, with the help of Prophet, I was able to get it working. Until last week.

Basically: on this website, each listing has a MLS number and I am trying to scrap that.

Now the changes since the original question is mostly a new logic to handle a random popup via a try block to try to click on the random pop up, if it works proceed to scrap, if it fails to find the pop up then proceed to scrap anyway. That try block logic worked. But now when I run the code I notice the price and MLS scrapping is returning empty results.

I added a print statement and confirmed both mls and price is empty: mls is price is

The main.py is the code with try blocks. https://github.com/jzoudavy/webScrap_Selenium/blob/main/main.py

To simplify this I took away the try block and other extra stuff: https://github.com/jzoudavy/webScrap_Selenium/blob/main/test.py but running the test.py confirms also that I am no longer picking up price or mls, and everything else still scrapes fine.

CodePudding user response:

You can try the next example:

from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
import time

options = webdriver.ChromeOptions()
options.add_argument("--no-sandbox")
options.add_argument('--disable-blink-features=AutomationControlled')
options.add_argument("start-maximized")
options.add_experimental_option("detach", True)


s=Service('./chromedriver')
driver= webdriver.Chrome(service=s, options=options)
url='https://www.centris.ca/en/properties~for-sale~brossard?view=Thumbnail'
driver.get(url)
time.sleep(4)


listings = driver.find_elements(By.CLASS_NAME, 'description')

for listing in listings:
    try:
        price = listing.find_element(By.XPATH, './/*[@itemprop="price"]//following-sibling::span[1]').text
        data_mlsnumber = listing.find_element(By.XPATH, './/*[@]').get_attribute('data-mlsnumber')
        print(price , data_mlsnumber)
    except:
        pass
    

Output:

$379,900 26194286
$529,900 23424373
$1,198,000 11994635
$345,900 23572465
$769,000 18521757
$574,000 28083515
$445,000 16204179
$329,000 14472385
$499,000 21331679
$515,000 13217312
$445,000 25003504
$799,000 9396792
$848,000 16371416
$598,000 10627269
$439,000 23978353
$419,900 16298302
$750,000 15553712
$429,000 24377480
$599,000 25292367
$1,790,000 21076634

CodePudding user response:

solution

you can get the ads and get the data-mlsnumber value in the postings by get_attribute()

ads = driver.find_elements(By.XPATH, "//div[@class='description']/a")
prices = driver.find_elements(By.XPATH, "//div[@class='price']/span[text()][not(@class='desc')]")
for i in range(0, len(ads)):
    print(f'mls: {ads[i].get_attribute("data-mlsnumber")}, price: {prices[i].text.strip()}')
  • Related