I'm trying to scrape data from this website: myworkdayjobs link
The data I want to collect are the job advertisemnts and their respective data. Currently there are 7 jobs active.
On the inspect page I can see the 7 wanted elements all having the same: li
But the page.html.xpath() always returns me an empty list.
The steps I've taken are:
session = HTMLSession()
url = (
'https://nvidia.wd5.myworkdayjobs.com/NVIDIAExternalCareerSite'
'?locations=91336993fab910af6d6f80c09504c167'
'&jobFamilyGroup=0c40f6bd1d8f10ae43ffaefd46dc7e78'
)
page = session.get(url)
page.html.render(sleep=1, keep_page=True, scrolldown=1)
cards = page.html.xpath("the_xpath_here")
print(cards)
I've also tried multiple other xpaths including the xpath I get when I copy the xpath path from rightclicking the element on the inspect page:
//*[@id="mainContent"]/div/div[2]/section/ul/li[1]/div[1]
/html/body/div/div/div/div[3]/div/div/div[2]/section/ul/li[1]/div[1]
//*[@id="mainContent"]/div/div[2]/section/ul/li[1]
Now, the only time I get results for a li element is when I
cards = page.html.xpath('//li')
Which returns me the li elements at the bottom of the page. But it fully ignores the elements that I want...
I'm not an expert on webscraping or so, but I have made it work with an other careers page before. What am I missing? Why can't I access those elements?
========================================================= Additional information: The problem that I experience seems to happen after the section element.
When I
cards = page.html.xpath('//*[@id="mainContent"]/div/div[2]/section/*')
print(cards)
[<Element 'p' data-automation-id='jobFoundText' class=('css-12psxof',)>, <Element 'div' data-automation-id='jobJumpToDetailsContainer' class=('css-14l0ax5',)>, <Element 'div' class=('css-19kzrtu',)>]
Why isn't there no ul element in the list? It's clearly there in the inspect window.
========================================================= Answer (Because the answer is in the accepted solution comment)
The page had aparently not fully loaded by the time of the assignement of cards and thus the ul was not there yet.
Adding one more second on the renderer sleep did the trick (sleep=2).
session = HTMLSession()
url = (
'https://nvidia.wd5.myworkdayjobs.com/NVIDIAExternalCareerSite'
'?locations=91336993fab910af6d6f80c09504c167'
'&jobFamilyGroup=0c40f6bd1d8f10ae43ffaefd46dc7e78'
)
page = session.get(url)
page.html.render(sleep=2, keep_page=True, scrolldown=1)
cards = page.html.xpath("the_xpath_here")
print(cards)
CodePudding user response:
You say you want li elements but your 3 variants of xpath point to div or single li. Try out specific xpath you need '//li[@]'
CodePudding user response:
You can try to use their Ajax API to get the Json data about the jobs. For example:
import requests
api_url = (
"https://nvidia.wd5.myworkdayjobs.com/wday/cxs/nvidia/NVIDIAExternalCareerSite/jobs"
)
payload = {
"appliedFacets": {
"jobFamilyGroup": ["0c40f6bd1d8f10ae43ffaefd46dc7e78"],
"locations": ["91336993fab910af6d6f80c09504c167"],
},
"limit": 20,
"offset": 0,
"searchText": "",
}
data = requests.post(api_url, json=payload).json()
print(data)
Prints:
{
"total": 7,
"jobPostings": [
{
"title": "Senior CPU Compiler Engineer",
"externalPath": "/job/UK-Remote/Senior-CPU-Compiler-Engineer_JR1954638",
"locationsText": "7 Locations",
"postedOn": "Posted 18 Days Ago",
"bulletFields": ["JR1954638"],
},
{
"title": "CPU Compiler Engineer",
"externalPath": "/job/UK-Remote/CPU-Compiler-Engineer_JR1954640-1",
"locationsText": "7 Locations",
"postedOn": "Posted 26 Days Ago",
"bulletFields": ["JR1954640"],
},
...and so on.