Previously working scrape code (Py/Selenium) broke. Now it returns a blank list. As a single example it is meant to visit: https://powersearch.jll.com/ca-en/property/52770/centurion-plaza-10335-172-street
And return me the pdf link: https://powersearch.jll.com/res/docs/jll - centurion plaza - brochure - 02232022_11108659.pdf
driver_service = Service(executable_path="C:\\WPy64-39100\\chromedriver.exe")
chrome_options = Options()
chrome_options.add_experimental_option("detach", True)
chrome_options.add_argument("--headless")
driver = webdriver.Chrome(service = driver_service, options=chrome_options)
site = 'https://powersearch.jll.com/ca-en/property/52770/centurion-plaza-10335-172-street'
driver.get(site)
time.sleep(10)
elements = driver.find_elements(By.CLASS_NAME, 'pt-res-link')
links = [e.get_attribute("href") for e in elements]
print(links)
I've tried various iterations of find element (and find elements) and trying to use class = "pt-res-link" is not reliably working. Any advice appreciated, thanks.
CodePudding user response:
I have used request and bs4 to get your desired output
Full Code
import requests
from bs4 import BeautifulSoup
url = "https://powersearch.jll.com/ca-en/property/52770/centurion-plaza-10335-172-street"
headers = {
"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.0.0 Safari/537.36"
}
r = requests.get(url, headers=headers)
soup = BeautifulSoup(r.content, "html.parser")
script = soup.find("script", {"id": "PowerSearch-state"}).text.split(";")
for i in script:
if ".pdf" in i and "https://" in i:
print(i.split("&q")[0])
Output
https://powersearch.jll.com/res/docs/jll - centurion plaza - brochure - 02232022_11108659.pdf