Home > Enterprise >  Python - Need Help Web Scraping Dynamic Website
Python - Need Help Web Scraping Dynamic Website

Time:12-05

I'm pretty new to web scraping and would appreciate any advice for the scenarios below:

I'm trying to produce a home loans listing table using data from https://www.canstar.com.au/home-loans/

I'm mainly trying to get listings values like the ones below:

  • Homestar Finance | Star Essentials P&I 80% | Variable
  • Unloan | Home Loan LVR <80% | Variable
  • TicToc Home Loans | Live-in Variable P&I | Variable
  • ubank | Neat Home Loan Owner Occupied P&I 70-80% | Variable

and push them into a nested table results = [[Homestar Finance, Star Essentials P&I 80%, Variable], etc, etc]

My first attempt, I've used BeautifulSoup entirely and practice on an offline version of the site.

import pandas as pd
from bs4 import BeautifulSoup

with open('/local/path/canstar.html', 'r') as canstar_offline :
    content = canstar_offline.read()

results = [['Affiliate', 'Product Name', 'Product Type']]
    
soup = BeautifulSoup(content, 'lxml')

for listing in soup.find_all('div', class_='table-cards-container') :
    for listing1 in listing.find_all('a') :
        if listing1.text.strip() != 'More details' and listing1.text.strip() != '' :
            results.append(listing1.text.strip().split(' | '))
   
df = pd.DataFrame(results[1:], columns=results[0]).to_dict('list')
df2 = pd.DataFrame(df)

print(df2)

I pretty much got very close to what I wanted, but unfortunately it doesn't work for the actual site cause it looks like I'm getting blocked for repeated requests.

So I tried this again on Selenium but now I'm stuck.

I tried using as much of the transferrable filtering logic that I used from BS, but I can't get anywhere close to what I had using Selenium.

import time
from selenium.webdriver.common.by import By

url = 'https://www.canstar.com.au/home-loans'

results = []

driver = webdriver.Chrome()
driver.get(url)
# content = driver.page_source
# soup = BeautifulSoup(content)

time.sleep(3)
tables = driver.find_elements(By.CLASS_NAME, 'table-cards-container')
for table in tables :
    listing = table.find_element(By.TAG_NAME, 'a')
    print(listing.text)

This version (above) only returns one listing (I'm trying to get the entire table through iteration)

import time
from selenium.webdriver.common.by import By

url = 'https://www.canstar.com.au/home-loans'

results = []

driver = webdriver.Chrome()
driver.get(url)
# content = driver.page_source
# soup = BeautifulSoup(content)

time.sleep(3)
tables = driver.find_elements(By.CLASS_NAME, 'table-cards-container')
for table in tables :
#     listing = table.find_element(By.TAG_NAME, 'a')
    print(table.text)

This version (above) looks like it gets all the text from the 'table-cards-container' class, but I'm unable to filter through it to just get the listings.

CodePudding user response:

I think you can try something like this, I hope the comments in the code explain what it is doing.

# Needed libs
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

# Initiate the driver and navigate
driver = webdriver.Chrome()
url = 'https://www.canstar.com.au/home-loans'
driver.get(url)

# We save the loans list
loans = WebDriverWait(driver, 30).until(EC.presence_of_all_elements_located((By.XPATH, "//cnslib-table-card")))

# We make a loop once per loan in the loop
for i in range(1, len(loans)):
    # With this Xpath I save the title of the loan
    loan_title = WebDriverWait(driver, 30).until(EC.presence_of_element_located((By.XPATH, f"((//cnslib-table-card)[{i}]//a)[1]"))).text
    print(loan_title)
    # With this Xpath I save the first percentaje we see for the loan
    loan_first_percentaje = WebDriverWait(driver, 30).until(EC.presence_of_element_located((By.XPATH, f"((//cnslib-table-card)[{i}]//span)[1]"))).text
    print(loan_first_percentaje)
    # With this Xpath I save the second percentaje we see for the loan
    loan_second_percentaje = WebDriverWait(driver, 30).until(EC.presence_of_element_located((By.XPATH, f"((//cnslib-table-card)[{i}]//span)[3]"))).text
    print(loan_second_percentaje)
    # With this Xpath I save the amount we see for the loan
    loan_amount = WebDriverWait(driver, 30).until(EC.presence_of_element_located((By.XPATH, f"((//cnslib-table-card)[{i}]//span)[5]"))).text
    print(loan_amount)
  • Related