Selenium Scraping Script to Beautiful Soup-CodePudding

Hi everyone so this script below is for Selenium but its extremely slow and not feasible for large amount of urls can anyone tell how to convert it into fast Bs4 script and can Beautiful Soup Scrape Click To Show buttons? Thank you everyone for helping me!

from selenium import webdriver
import time
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
chrome_path = r"C:\Users\lenovo\Downloads\chromedriver_win32 (5)\chromedriver.exe"
driver = webdriver.Chrome(chrome_path)

driver.maximize_window()
driver.implicitly_wait(10)

driver.get("https://www.autotrader.ca/a/ram/1500/hamilton/ontario/19_12052335_/?showcpo=ShowCpo&ncse=no&ursrc=pl&urp=2&urm=8&sprx=-2")
wait =WebDriverWait(driver,30)


driver.find_element_by_xpath('//button[@class="close-button"]').click()
option = wait.until(EC.element_to_be_clickable((By.XPATH,"//a[text()= 'Click to show']")))
driver.execute_script("arguments[0].scrollIntoView(true);",option)
option.click()
time.sleep(10)

Name = driver.find_element_by_xpath('//p[@class="hero-title"]')
Number = driver.find_element_by_xpath('//div[@class="card-body"]')
print(Name.text,Number.text)

CodePudding user response：

You don't really need to use selenium here, you can simple use requests as the phone number you're looking for is in the HTML (just not visible).

If you click on "view page source" in your browser you can ctrl f for the phone number:

So you don't need to emulate browser and button clicking - everything is there!

Now lets see how we can scrape this data just by using requests (or any other http client like httpx or aiohttp):

import requests
import re

url = "https://www.autotrader.ca/a/ram/1500/hamilton/ontario/19_12052335_/?showcpo=ShowCpo&ncse=no&ursrc=pl&urp=2&urm=8&sprx=-2"
# we need to pretend that our request is coming from a web browser to get around anti-bot protection by setting user agent string header to a web-browsers one
# in this case we use windows chrome browser user agent string (you can find these online)
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36'}

# here we make request for html page
response = requests.get(url, headers=headers)

# now we can use regex patterns to find phone number
phone_number = re.findall('"phoneNumber":"([\d-] )"', response.text)
["905-870-7127"]
description = re.findall('"description":"(. ?)"', response.text)
['2011 Ram 1500 Sport Crew Cab v8 5.7L - Fully loaded, Crew cab, leather heated/air-conditioned seats, heated leather steering wheel, 5’7 ft box w/ tonneau cover.']

Regex patterns are a bit of work to wrap your head around at first. I suggest googling "regex python tutorial" if you want to learn more but I can explain the pattern we're using here: we want to capture everything in double-quotes that follows "phoneNumber":" string and is either a digit (marked as \d) or a dash (marked as simply -).

This requests script would only take few seconds to complete and use almost no computing resources. However one thing to watch out when using http client compared to Selenium browser emulation is bot blocking which often requires quite a bit of development work to get around though performance gains are really worth it!

CodePudding user response：

Now that you use ‘selenium’, what performance is there? Maybe you can consider using ‘requests’？