I'm trying to scrape a page that has elements being created by JavaScript. When I pass through my script it does not give the full HTML. Is there a way to render the page first then obtain the HTML?
import bs4
from selenium import webdriver
url = "https://cibc.wd3.myworkdayjobs.com/search"
browser = webdriver.Chrome()
browser.get(url)
time.sleep(2)
resLog = browser.page_source
soup = bs4.BeautifulSoup(resLog, "html.parser")
print(soup)
CodePudding user response:
Data is dynamically loaded from api calls json response via XHR
as POST method.Here is an example how to grab all jobposts data and you can grab all items whatever you want. Here I scrape only all titles
.
Program:
import requests
import pandas as pd
import json
body= {"appliedFacets":{},"limit":20,"offset":20,"searchText":""}
headers= {
'content-type': 'application/json',
}
api_url = "https://cibc.wd3.myworkdayjobs.com/wday/cxs/cibc/search/jobs"
jsonData = requests.post(api_url, data=json.dumps(body), headers=headers).json()
data=[]
for limit in range(0,1726,20):
jsonData['limit']= limit
for item in jsonData['jobPostings']:
title=item['title']
data.append(title)
#print(title)
df =pd.DataFrame(data,columns=["Title"])
print(df)
Output:
Title
0 Senior IT Manager PMO
1 Financial Services Representative II
2 Mobile Mortgage Advisor Assistant -P/T OTTAWA,...
3 Financial Services Representative
4 Dev Ops Consultant
... ...
1735 Vice-President, Private Wealth and Asset Manag...
1736 Technical Lead
1737 Sr. Analyst, Procurement
1738 Sr. Financial Advisor - IIROC (Urban)
1739 Sr. Consultant, Vulnerability Management and C...
[1740 rows x 1 columns]
CodePudding user response:
What happens?
There are different issues if you call the site:
Depending on your location it will make a redirect
Loading / rendering the content will need a moment
How to fix?
The magic is to wait - so use WebDriverWait
to wait until the presence of expected elements is located:
driver.get(url)
wait = WebDriverWait(driver, 10)
wait.until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, '#mainContent li')))
or in combination with change of the url:
driver.get(url)
wait = WebDriverWait(driver, 10)
wait.until(EC.url_changes('https://current_page.com'))
wait.until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, '#mainContent li')))
Example
from selenium import webdriver
from selenium.webdriver.common.by import By
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import pandas as pd
url = 'https://cibc.wd3.myworkdayjobs.com/search'
driver = webdriver.Chrome(ChromeDriverManager().install())
driver.maximize_window()
driver.get(url)
wait = WebDriverWait(driver, 10)
wait.until(EC.url_changes('https://current_page.com'))
wait.until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, '#mainContent li')))
soup = BeautifulSoup(driver.page_source)
data = []
for a in soup.select('#mainContent li a'):
data.append({
'title':a.text,
'url':'https://cibc.wd3.myworkdayjobs.com' a['href']
})
pd.DataFrame(data)
Output
...