I'm pretty new to web scraping and would appreciate any advice for the scenarios below:
I'm trying to produce a home loans listing table using data from https://www.canstar.com.au/home-loans/
I'm mainly trying to get listings values like the ones below:
- Homestar Finance | Star Essentials P&I 80% | Variable
- Unloan | Home Loan LVR <80% | Variable
- TicToc Home Loans | Live-in Variable P&I | Variable
- ubank | Neat Home Loan Owner Occupied P&I 70-80% | Variable
and push them into a nested table results = [[Homestar Finance, Star Essentials P&I 80%, Variable], etc, etc]
My first attempt, I've used BeautifulSoup entirely and practice on an offline version of the site.
import pandas as pd
from bs4 import BeautifulSoup
with open('/local/path/canstar.html', 'r') as canstar_offline :
content = canstar_offline.read()
results = [['Affiliate', 'Product Name', 'Product Type']]
soup = BeautifulSoup(content, 'lxml')
for listing in soup.find_all('div', class_='table-cards-container') :
for listing1 in listing.find_all('a') :
if listing1.text.strip() != 'More details' and listing1.text.strip() != '' :
results.append(listing1.text.strip().split(' | '))
df = pd.DataFrame(results[1:], columns=results[0]).to_dict('list')
df2 = pd.DataFrame(df)
print(df2)
I pretty much got very close to what I wanted, but unfortunately it doesn't work for the actual site cause it looks like I'm getting blocked for repeated requests.
So I tried this again on Selenium but now I'm stuck.
I tried using as much of the transferrable filtering logic that I used from BS, but I can't get anywhere close to what I had using Selenium.
import time
from selenium.webdriver.common.by import By
url = 'https://www.canstar.com.au/home-loans'
results = []
driver = webdriver.Chrome()
driver.get(url)
# content = driver.page_source
# soup = BeautifulSoup(content)
time.sleep(3)
tables = driver.find_elements(By.CLASS_NAME, 'table-cards-container')
for table in tables :
listing = table.find_element(By.TAG_NAME, 'a')
print(listing.text)
This version (above) only returns one listing (I'm trying to get the entire table through iteration)
import time
from selenium.webdriver.common.by import By
url = 'https://www.canstar.com.au/home-loans'
results = []
driver = webdriver.Chrome()
driver.get(url)
# content = driver.page_source
# soup = BeautifulSoup(content)
time.sleep(3)
tables = driver.find_elements(By.CLASS_NAME, 'table-cards-container')
for table in tables :
# listing = table.find_element(By.TAG_NAME, 'a')
print(table.text)
This version (above) looks like it gets all the text from the 'table-cards-container' class, but I'm unable to filter through it to just get the listings.
CodePudding user response:
I think you can try something like this, I hope the comments in the code explain what it is doing.
# Needed libs
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
# Initiate the driver and navigate
driver = webdriver.Chrome()
url = 'https://www.canstar.com.au/home-loans'
driver.get(url)
# We save the loans list
loans = WebDriverWait(driver, 30).until(EC.presence_of_all_elements_located((By.XPATH, "//cnslib-table-card")))
# We make a loop once per loan in the loop
for i in range(1, len(loans)):
# With this Xpath I save the title of the loan
loan_title = WebDriverWait(driver, 30).until(EC.presence_of_element_located((By.XPATH, f"((//cnslib-table-card)[{i}]//a)[1]"))).text
print(loan_title)
# With this Xpath I save the first percentaje we see for the loan
loan_first_percentaje = WebDriverWait(driver, 30).until(EC.presence_of_element_located((By.XPATH, f"((//cnslib-table-card)[{i}]//span)[1]"))).text
print(loan_first_percentaje)
# With this Xpath I save the second percentaje we see for the loan
loan_second_percentaje = WebDriverWait(driver, 30).until(EC.presence_of_element_located((By.XPATH, f"((//cnslib-table-card)[{i}]//span)[3]"))).text
print(loan_second_percentaje)
# With this Xpath I save the amount we see for the loan
loan_amount = WebDriverWait(driver, 30).until(EC.presence_of_element_located((By.XPATH, f"((//cnslib-table-card)[{i}]//span)[5]"))).text
print(loan_amount)