Home > Back-end >  How to scrape salary for each job post, its url and closing date?
How to scrape salary for each job post, its url and closing date?

Time:09-06

I am trying to find the salary, closing date, and URL for each job post on a website. I am successful to the part of getting information on the main page. But I am failing to get information for each job post. url = https://www.higheredjobs.com/admin/search.cfm?JobCat=141&CatName=Academic Advising

expected output: df =

title university location study date url salary closing date
0 Educational Advisor Pasadena City College Pasadena, CA Academic Advising 09/04/22 https://www.higheredjobs.com/admin/details.cfm?JobCode=178087599&Title=Educational Advisor $58,620.12 to 64,628.76 9/25/2022 11:59 PM Pacific
1 Admin and Office Spec III: School of Humanities & Social Sciences Laurel Ridge Community College Middletown, VA Academic Advising 09/04/22 https://www.higheredjobs.com/admin/details.cfm?JobCode=178087578&Title=Admin and Office Spec III: School of Humanities & Social Sciences $32,000-$48,000 09/16/2022

code:

from pprint import pprint
import datetime
import requests
from bs4 import BeautifulSoup

cookies = {
    'CFID': '180615757',
    'CFTOKEN': '64089929988eb934-58E2ACC9-AD21-785B-2AFBCE86106B41FE',
    'visid_incap_2388351': '0Vmr7QpDRvmVw8fbXUJFkB5XEWMAAAAAQUIPAAAAAADtlXunU/D8GLU5VofHHier',
    '_ga_6ZQNJ4ELG2': 'GS1.1.1662315508.15.1.1662315668.0.0.0',
    '_ga': 'GA1.2.147261521.1662080801',
    '_gid': 'GA1.2.1149490171.1662080801',
    'reese84': '3:yMGXsdMquwoCj3IoSFRCMg==:Vf20HwL77P8oWYTTKbE0XigwyQE3d2lLQpPVoZYcoL8SJTmLeqAani 7GspfC2BiJYOOytBlkIp9MewLgs/XbkaiLrSvLnMdZ0aT8/M9FvBohByybnJXNl25ya/yfpGhL9oT1HKMZYnKqSR0Sg8 nHTUEO0/YErJgQmfoeYIT4kmE01S8cndGIemtuGjvq1hzB/D9VAQL7S3idutOumBNu84j5FyCdOBClCJTriE X9j40lj1swIxFlryTmBAtLHnEvN9M57N4LMb13yuSBaCawrv4fnron0JnUvfKpLU0CXTnpcM9hJNGv9Ekb4Ap43CZDPdeLVzEmj 39wCVtXPtMqBNCU6mPVBSeJCRHyRuQjY y0Sv5w7ME2LXhT8bEGHyE8yeuxddxvoG51STebu pb0mSp5n iKotUEn9h sA=:WH64twwKGqtE4pUorYOeGylONeXRsfG 3Qe3zAfpdrs=',
    '__atuvc': '65|35,2|36',
    'COOKIESTATUS': 'ON',
    'HIDECOOKIEBANNER': 'TRUE',
    'nlbi_2388351': 'jGGxMFazFBqnU x okRrFAAAAAC/AJ/k R2U vs5Q4LIRTS7',
    'nlbi_2388351_2147483392': 'PUildkEvtiZ9uje3okRrFAAAAABv1NR/7gPLX7Lc/iS5ei8N',
    'incap_ses_989_2388351': 'mWy Uq7aLX000xomDaO5DfTrFGMAAAAA6XmB42vG5CO6i609/RhyKg==',
    'incap_ses_468_2388351': 'sDNcR2labTHyNXYlUqx BipAFGMAAAAAImV2A07lGANZGfpvhvPlLg==',
    '__atuvs': '6314ec0cdbe92a78001',
    '_gat_gtag_UA_12825325_1': '1',
}

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:104.0) Gecko/20100101 Firefox/104.0',
    'Accept': 'text/html,application/xhtml xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8',
    'Accept-Language': 'en-US,en;q=0.5',
    # 'Accept-Encoding': 'gzip, deflate, br',
    'Referer': 'https://www.higheredjobs.com/admin/',
    'Connection': 'keep-alive',
    # Requests sorts cookies= alphabetically
    # 'Cookie': 'CFID=180615757; CFTOKEN=64089929988eb934-58E2ACC9-AD21-785B-2AFBCE86106B41FE; visid_incap_2388351=0Vmr7QpDRvmVw8fbXUJFkB5XEWMAAAAAQUIPAAAAAADtlXunU/D8GLU5VofHHier; _ga_6ZQNJ4ELG2=GS1.1.1662315508.15.1.1662315668.0.0.0; _ga=GA1.2.147261521.1662080801; _gid=GA1.2.1149490171.1662080801; reese84=3:yMGXsdMquwoCj3IoSFRCMg==:Vf20HwL77P8oWYTTKbE0XigwyQE3d2lLQpPVoZYcoL8SJTmLeqAani 7GspfC2BiJYOOytBlkIp9MewLgs/XbkaiLrSvLnMdZ0aT8/M9FvBohByybnJXNl25ya/yfpGhL9oT1HKMZYnKqSR0Sg8 nHTUEO0/YErJgQmfoeYIT4kmE01S8cndGIemtuGjvq1hzB/D9VAQL7S3idutOumBNu84j5FyCdOBClCJTriE X9j40lj1swIxFlryTmBAtLHnEvN9M57N4LMb13yuSBaCawrv4fnron0JnUvfKpLU0CXTnpcM9hJNGv9Ekb4Ap43CZDPdeLVzEmj 39wCVtXPtMqBNCU6mPVBSeJCRHyRuQjY y0Sv5w7ME2LXhT8bEGHyE8yeuxddxvoG51STebu pb0mSp5n iKotUEn9h sA=:WH64twwKGqtE4pUorYOeGylONeXRsfG 3Qe3zAfpdrs=; __atuvc=65|35,2|36; COOKIESTATUS=ON; HIDECOOKIEBANNER=TRUE; nlbi_2388351=jGGxMFazFBqnU x okRrFAAAAAC/AJ/k R2U vs5Q4LIRTS7; nlbi_2388351_2147483392=PUildkEvtiZ9uje3okRrFAAAAABv1NR/7gPLX7Lc/iS5ei8N; incap_ses_989_2388351=mWy Uq7aLX000xomDaO5DfTrFGMAAAAA6XmB42vG5CO6i609/RhyKg==; incap_ses_468_2388351=sDNcR2labTHyNXYlUqx BipAFGMAAAAAImV2A07lGANZGfpvhvPlLg==; __atuvs=6314ec0cdbe92a78001; _gat_gtag_UA_12825325_1=1',
    'Upgrade-Insecure-Requests': '1',
    'Sec-Fetch-Dest': 'document',
    'Sec-Fetch-Mode': 'navigate',
    'Sec-Fetch-Site': 'same-origin',
    # Requests doesn't support trailers
    # 'TE': 'trailers',
}

params = {
    'JobCat': '141',
    'CatName': 'Academic Advising',
}

response = requests.get('https://www.higheredjobs.com/admin/search.cfm', params=params, cookies=cookies, headers=headers)

soup = BeautifulSoup(response.text,'html.parser')#'lxml')#
jobs_list = []
for i in soup.select('.row.record'):
   jobs_list.append(dict(zip(['title','university','location','study','date'],i.stripped_strings)))
df = pd.DataFrame(jobs_list)

Present output: df =

title university location study date url salary closing date
0 Educational Advisor Pasadena City College Pasadena, CA Academic Advising 09/04/22
1 Admin and Office Spec III: School of Humanities & Social Sciences Laurel Ridge Community College Middletown, VA Academic Advising 09/04/22

Problem:

I could not scrap the sub-pages and information for each post. How to scrap sub-pages and required information.

CodePudding user response:

Information on this site is not that uniform, so you will not get a one fits all approach. To point in a direction that may fit your needs, grab the urls from initial page and iterate them to get information from detail page:

Note: List of links in example is sliced to [:5], to get all results from this iteration simply delete slicing

response = requests.get('https://www.higheredjobs.com/admin/search.cfm', params=params, cookies=cookies, headers=headers)

soup = BeautifulSoup(response.text,'html.parser')#'lxml')#
jobs_list = []
for a in soup.select('.row.record a')[:10]:
    r = requests.get('https://www.higheredjobs.com/' a.get('href'), params=params, cookies=cookies, headers=headers)
    soup = BeautifulSoup(r.text)

    for e in soup.select('span.at'):
        e.decompose()
        
    d = dict((e.text.strip().rstrip(':'),e.next_sibling.strip()) for e in list(soup.select('#jobAttrib strong')))
    d.update({
        'Title':soup.h1.text,
        'Institute':soup.select_one('.job-inst').get_text(strip=True),
        'Location':soup.select_one('.job-loc').get_text(strip=True),
        'Category':soup.select_one('strong:-soup-contains("Category")').find_next().text.strip()
    })

    jobs_list.append(d)


df = pd.DataFrame(jobs_list)
df[['Title', 'Institute','Location', 'Salary','Type', 'Posted', 'Application Due', 'Category']]

Output

Title Institute Location Salary Type Posted Application Due Category
0 Academic Advisor Appalachian State University Boone, NC nan Full-Time 09/04/2022 Open Until Filled Academic Advising
1 Educational Advisor Pasadena City College Pasadena, CA 58,620.12 to 64,628.76 USD Per Year Full-Time 09/04/2022 nan Academic Advising
2 Admin and Office Spec III: School of Humanities & Social Sciences Laurel Ridge Community College Middletown, VA nan Full-Time 09/04/2022 09/16/2022 Academic Advising
3 Post-Licensure Instructional Coordinator and Academic Advisor Illinois State University Normal, IL nan Full-Time 09/04/2022 nan Academic Advising
4 College Advising Corps (CAC) Adviser Georgia State University Atlanta, GA nan Adjunct/Part-Time 09/04/2022 nan Academic Advising
5 Academic Advisor I ( S03944P) University of Texas at Arlington Arlington, TX nan Full-Time 09/03/2022 nan Academic Advising
6 Career & Academic Advisor (Part-Time) St. Petersburg College Pinellas Park, FL nan Adjunct/Part-Time 09/03/2022 nan Career Development and Services
7 Academic Advisor Simmons University Boston, MA nan Full-Time 09/03/2022 nan Academic Advising
8 Asst Director HSC Roanoke Radford University Radford, VA nan Full-Time 09/03/2022 nan Academic Advising
9 Academic Advisor, University Advising Center,Tahlequah Northeastern State University Tahlequah, OK nan Full-Time 09/03/2022 nan Academic Advising
  • Related