python selenium webscraping (clicking buttons which shows data and then extracting it)-CodePudding

so what I'm trying to do is: https://www.jobbank.gc.ca/jobsearch/jobsearch?sort=D&fsrc=16&fbclid=IwAR2SIG3lbY1S9lO4WilcKw6TxJAJQbFIGYTVE_tOTqYRpb43qM3uYgLWV64, < in this link open all listings and then when it redirects to another page there is a button ( Show how to apply ) when we click on that button there will be shown an email address. So I want to to scrape every job listing title and email address through my code. I already scraped titles and hrefs but have no idea what to do next(e.g clicking on every job listing, then clicking to "Show how to apply" and scraping emails from there). I hope you guys understand what I want to do ( Sorry for my english )

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.service import Service
import os
s = Service('C:\Program Files (x86)\chromedriver.exe')
driver = webdriver.Chrome(service=s)
driver.get('https://www.jobbank.gc.ca/jobsearch/jobsearch?sort=D&fsrc=16&fbclid=IwAR2SIG3lbY1S9lO4WilcKw6TxJAJQbFIGYTVE_tOTqYRpb43qM3uYgLWV64')

# Get titles of Job listings
elements = []
for element in driver.find_elements(By.CLASS_NAME, 'resultJobItem'):
    title = element.find_element(By.XPATH, './/*[@]').text
    if title not in elements:
        elements.append({'Title': title.split('\n')})

# Get all href
link = driver.find_elements(By.XPATH, './/*[@]/article/a')
for links in link:
    elements.append({'Link': links.get_attribute('href')})

print(elements)

CodePudding user response：

Looks like you can use their own api with a post request to get the data.

You'll need to scrape the job id.

so for the job on this url: https://www.jobbank.gc.ca/jobsearch/jobposting/35213663 i see that the job id is 1860693. so ill need to post a request like this.

import requests
from bs4 import BeautifulSoup as BS

url = "https://www.jobbank.gc.ca/jobsearch/jobposting/35213663"  
jobid = "1860693"
data = {
  'seekeractivity:jobid': f'{jobid}',
  'seekeractivity_SUBMIT': '1',
  'javax.faces.ViewState': 'stateless',
  'javax.faces.behavior.event': 'action',
  'jbfeJobId': f'{jobid}',
  'action': 'applynowbutton',
  'javax.faces.partial.event': 'click',
  'javax.faces.source': 'seekeractivity',
  'javax.faces.partial.ajax': 'true',
  'javax.faces.partial.execute': 'jobid',
  'javax.faces.partial.render': 'applynow',
  'seekeractivity': 'seekeractivity'
}

response = requests.post(url, data)

soup = BS(response.text)
email = soup.a.text
print(email)
this gives me
>> [email protected]

CodePudding user response：

I would store all the links seperately.
So assume the following variable all_links contains all the links. Now,

.
.
.
driver.quit()

link1 = all_links[0] # lets take the example of the first link. youd have to for loop through all the link; for link in links

new_driver = webdriver.Chrome(service=s)
new_driver.get(link1)

new_driver.find_element_by_css_selector("#applynowbutton").click()

At this point the 'Show how to Apply' button has been clicked.

Unfortunately, I dont know too much about html and all but essentially at this point you can extract the email much like you extracted all the links previously

CodePudding user response：

Try like below:

Can apply scrollIntoView to the particular job option. When it reaches the end, click on Show more option and continue extracting details.

driver.get("https://www.jobbank.gc.ca/jobsearch/jobsearch?sort=D&fsrc=16&fbclid=IwAR2SIG3lbY1S9lO4WilcKw6TxJAJQbFIGYTVE_tOTqYRpb43qM3uYgLWV64")

i = 0
while True:
    try:
        jobs = driver.find_elements_by_xpath("//div[@class='results-jobs']/article")
        driver.execute_script("arguments[0].scrollIntoView(true);",jobs[i])
        title = jobs[i].find_element_by_xpath(".//span[@class='noctitle']").text
        link = jobs[i].find_element_by_tag_name("a").get_attribute("href")
        print(f"{i 1} - {title} : {link}")
        i =1
        if i == 100:
            break
    except IndexError:
        driver.find_element_by_id("moreresultbutton").click()
        time.sleep(3)