Scraping after page is loaded with Python-CodePudding

I'm trying to scrape a page that has elements being created by JavaScript. When I pass through my script it does not give the full HTML. Is there a way to render the page first then obtain the HTML?

import bs4
from selenium import webdriver

url = "https://cibc.wd3.myworkdayjobs.com/search"
browser = webdriver.Chrome()
browser.get(url)
time.sleep(2)

resLog = browser.page_source
soup = bs4.BeautifulSoup(resLog, "html.parser")

print(soup)

CodePudding user response：

Data is dynamically loaded from api calls json response via XHR as POST method.Here is an example how to grab all jobposts data and you can grab all items whatever you want. Here I scrape only all titles.

Program:

import requests
import pandas as pd
import json
body= {"appliedFacets":{},"limit":20,"offset":20,"searchText":""}
headers= {
    'content-type': 'application/json',
   }

api_url = "https://cibc.wd3.myworkdayjobs.com/wday/cxs/cibc/search/jobs"


jsonData = requests.post(api_url, data=json.dumps(body), headers=headers).json()

data=[]   
for limit in range(0,1726,20):
    jsonData['limit']= limit
    for item in jsonData['jobPostings']:
        title=item['title']
        data.append(title)
        #print(title)

df =pd.DataFrame(data,columns=["Title"])
print(df)

Output:

                                     Title
0                                 Senior IT Manager PMO
1                  Financial Services Representative II
2     Mobile Mortgage Advisor Assistant -P/T OTTAWA,...
3                     Financial Services Representative
4                                    Dev Ops Consultant
...                                                 ...
1735  Vice-President, Private Wealth and Asset Manag...
1736                                     Technical Lead
1737                           Sr. Analyst, Procurement
1738              Sr. Financial Advisor - IIROC (Urban)
1739  Sr. Consultant, Vulnerability Management and C...

[1740 rows x 1 columns]

CodePudding user response：

What happens?

There are different issues if you call the site:

Depending on your location it will make a redirect
Loading / rendering the content will need a moment

How to fix?

The magic is to wait - so use WebDriverWait to wait until the presence of expected elements is located:

driver.get(url)

wait = WebDriverWait(driver, 10)
wait.until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, '#mainContent li')))

or in combination with change of the url:

driver.get(url)

wait = WebDriverWait(driver, 10)
wait.until(EC.url_changes('https://current_page.com'))
wait.until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, '#mainContent li')))

Example

from selenium import webdriver
from selenium.webdriver.common.by import By
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import pandas as pd

url = 'https://cibc.wd3.myworkdayjobs.com/search'

driver = webdriver.Chrome(ChromeDriverManager().install())
driver.maximize_window()
driver.get(url)

wait = WebDriverWait(driver, 10)
wait.until(EC.url_changes('https://current_page.com'))
wait.until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, '#mainContent li')))

soup = BeautifulSoup(driver.page_source)
data = []
for a in soup.select('#mainContent li a'):
    data.append({
        'title':a.text,
        'url':'https://cibc.wd3.myworkdayjobs.com' a['href']
    })

pd.DataFrame(data)

Output

	title	url
0	Universal Banker I	https://cibc.wd3.myworkdayjobs.com/de-DE/search/job/Mississauga-ON/Universal-Banker-I_2213838
1	Associate Business Advisor	https://cibc.wd3.myworkdayjobs.com/de-DE/search/job/Fort-Saint-John-BC/Associate-Business-Advisor_2209976-1
2	Senior HR Business Partner	https://cibc.wd3.myworkdayjobs.com/de-DE/search/job/Toronto-ON/Senior-HR-Business-Partner_2215640
3	Coffee Chat with a Recruiter Event (Contact Centres) (Bilingual French/English) (Montreal)	https://cibc.wd3.myworkdayjobs.com/de-DE/search/job/Montral-QC/Coffee-Chat-with-a-Recruiter-Event--Contact-Centres---Bilingual-French-English---Montreal-_2206804-1
4	Financial Services Associate [Hourly]	https://cibc.wd3.myworkdayjobs.com/de-DE/search/job/Port-Perry-ON/Financial-Services-Associate--Hourly-_2214853
5	Manager, Regulatory Supervision	https://cibc.wd3.myworkdayjobs.com/de-DE/search/job/Toronto-ON/Manager--Regulatory-Supervision_2209939-1
6	Fraud Agent - Contact Centre - Regina (Remote)	https://cibc.wd3.myworkdayjobs.com/de-DE/search/job/Regina-SK/Fraud-Agent---Contact-Centre---Regina--Remote-_2216315-1
7	Sr. Director, IT Quality Assurance	https://cibc.wd3.myworkdayjobs.com/de-DE/search/job/Toronto-ON/Sr-Director--IT-Quality-Assurance_2216110
8	Senior BI Analyst	https://cibc.wd3.myworkdayjobs.com/de-DE/search/job/Toronto-ON/Senior-BI-Analyst_2207685

...