How to scrape salary for each job post, its url and closing date?-CodePudding

I am trying to find the salary, closing date, and URL for each job post on a website. I am successful to the part of getting information on the main page. But I am failing to get information for each job post. url = https://www.higheredjobs.com/admin/search.cfm?JobCat=141&CatName=Academic Advising

expected output: df =

	title	university	location	study	date	url	salary	closing date
0	Educational Advisor	Pasadena City College	Pasadena, CA	Academic Advising	09/04/22	https://www.higheredjobs.com/admin/details.cfm?JobCode=178087599&Title=Educational Advisor	$58,620.12 to 64,628.76	9/25/2022 11:59 PM Pacific
1	Admin and Office Spec III: School of Humanities & Social Sciences	Laurel Ridge Community College	Middletown, VA	Academic Advising	09/04/22	https://www.higheredjobs.com/admin/details.cfm?JobCode=178087578&Title=Admin and Office Spec III: School of Humanities & Social Sciences	$32,000-$48,000	09/16/2022

code:

from pprint import pprint
import datetime
import requests
from bs4 import BeautifulSoup

cookies = {
    'CFID': '180615757',
    'CFTOKEN': '64089929988eb934-58E2ACC9-AD21-785B-2AFBCE86106B41FE',
    'visid_incap_2388351': '0Vmr7QpDRvmVw8fbXUJFkB5XEWMAAAAAQUIPAAAAAADtlXunU/D8GLU5VofHHier',
    '_ga_6ZQNJ4ELG2': 'GS1.1.1662315508.15.1.1662315668.0.0.0',
    '_ga': 'GA1.2.147261521.1662080801',
    '_gid': 'GA1.2.1149490171.1662080801',
    'reese84': '3:yMGXsdMquwoCj3IoSFRCMg==:Vf20HwL77P8oWYTTKbE0XigwyQE3d2lLQpPVoZYcoL8SJTmLeqAani 7GspfC2BiJYOOytBlkIp9MewLgs/XbkaiLrSvLnMdZ0aT8/M9FvBohByybnJXNl25ya/yfpGhL9oT1HKMZYnKqSR0Sg8 nHTUEO0/YErJgQmfoeYIT4kmE01S8cndGIemtuGjvq1hzB/D9VAQL7S3idutOumBNu84j5FyCdOBClCJTriE X9j40lj1swIxFlryTmBAtLHnEvN9M57N4LMb13yuSBaCawrv4fnron0JnUvfKpLU0CXTnpcM9hJNGv9Ekb4Ap43CZDPdeLVzEmj 39wCVtXPtMqBNCU6mPVBSeJCRHyRuQjY y0Sv5w7ME2LXhT8bEGHyE8yeuxddxvoG51STebu pb0mSp5n iKotUEn9h sA=:WH64twwKGqtE4pUorYOeGylONeXRsfG 3Qe3zAfpdrs=',
    '__atuvc': '65|35,2|36',
    'COOKIESTATUS': 'ON',
    'HIDECOOKIEBANNER': 'TRUE',
    'nlbi_2388351': 'jGGxMFazFBqnU x okRrFAAAAAC/AJ/k R2U vs5Q4LIRTS7',
    'nlbi_2388351_2147483392': 'PUildkEvtiZ9uje3okRrFAAAAABv1NR/7gPLX7Lc/iS5ei8N',
    'incap_ses_989_2388351': 'mWy Uq7aLX000xomDaO5DfTrFGMAAAAA6XmB42vG5CO6i609/RhyKg==',
    'incap_ses_468_2388351': 'sDNcR2labTHyNXYlUqx BipAFGMAAAAAImV2A07lGANZGfpvhvPlLg==',
    '__atuvs': '6314ec0cdbe92a78001',
    '_gat_gtag_UA_12825325_1': '1',
}

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:104.0) Gecko/20100101 Firefox/104.0',
    'Accept': 'text/html,application/xhtml xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8',
    'Accept-Language': 'en-US,en;q=0.5',
    # 'Accept-Encoding': 'gzip, deflate, br',
    'Referer': 'https://www.higheredjobs.com/admin/',
    'Connection': 'keep-alive',
    # Requests sorts cookies= alphabetically
    # 'Cookie': 'CFID=180615757; CFTOKEN=64089929988eb934-58E2ACC9-AD21-785B-2AFBCE86106B41FE; visid_incap_2388351=0Vmr7QpDRvmVw8fbXUJFkB5XEWMAAAAAQUIPAAAAAADtlXunU/D8GLU5VofHHier; _ga_6ZQNJ4ELG2=GS1.1.1662315508.15.1.1662315668.0.0.0; _ga=GA1.2.147261521.1662080801; _gid=GA1.2.1149490171.1662080801; reese84=3:yMGXsdMquwoCj3IoSFRCMg==:Vf20HwL77P8oWYTTKbE0XigwyQE3d2lLQpPVoZYcoL8SJTmLeqAani 7GspfC2BiJYOOytBlkIp9MewLgs/XbkaiLrSvLnMdZ0aT8/M9FvBohByybnJXNl25ya/yfpGhL9oT1HKMZYnKqSR0Sg8 nHTUEO0/YErJgQmfoeYIT4kmE01S8cndGIemtuGjvq1hzB/D9VAQL7S3idutOumBNu84j5FyCdOBClCJTriE X9j40lj1swIxFlryTmBAtLHnEvN9M57N4LMb13yuSBaCawrv4fnron0JnUvfKpLU0CXTnpcM9hJNGv9Ekb4Ap43CZDPdeLVzEmj 39wCVtXPtMqBNCU6mPVBSeJCRHyRuQjY y0Sv5w7ME2LXhT8bEGHyE8yeuxddxvoG51STebu pb0mSp5n iKotUEn9h sA=:WH64twwKGqtE4pUorYOeGylONeXRsfG 3Qe3zAfpdrs=; __atuvc=65|35,2|36; COOKIESTATUS=ON; HIDECOOKIEBANNER=TRUE; nlbi_2388351=jGGxMFazFBqnU x okRrFAAAAAC/AJ/k R2U vs5Q4LIRTS7; nlbi_2388351_2147483392=PUildkEvtiZ9uje3okRrFAAAAABv1NR/7gPLX7Lc/iS5ei8N; incap_ses_989_2388351=mWy Uq7aLX000xomDaO5DfTrFGMAAAAA6XmB42vG5CO6i609/RhyKg==; incap_ses_468_2388351=sDNcR2labTHyNXYlUqx BipAFGMAAAAAImV2A07lGANZGfpvhvPlLg==; __atuvs=6314ec0cdbe92a78001; _gat_gtag_UA_12825325_1=1',
    'Upgrade-Insecure-Requests': '1',
    'Sec-Fetch-Dest': 'document',
    'Sec-Fetch-Mode': 'navigate',
    'Sec-Fetch-Site': 'same-origin',
    # Requests doesn't support trailers
    # 'TE': 'trailers',
}

params = {
    'JobCat': '141',
    'CatName': 'Academic Advising',
}

response = requests.get('https://www.higheredjobs.com/admin/search.cfm', params=params, cookies=cookies, headers=headers)

soup = BeautifulSoup(response.text,'html.parser')#'lxml')#
jobs_list = []
for i in soup.select('.row.record'):
   jobs_list.append(dict(zip(['title','university','location','study','date'],i.stripped_strings)))
df = pd.DataFrame(jobs_list)

Present output: df =

	title	university	location	study	date	url	salary	closing date
0	Educational Advisor	Pasadena City College	Pasadena, CA	Academic Advising	09/04/22
1	Admin and Office Spec III: School of Humanities & Social Sciences	Laurel Ridge Community College	Middletown, VA	Academic Advising	09/04/22

Problem:

I could not scrap the sub-pages and information for each post. How to scrap sub-pages and required information.

CodePudding user response：

Information on this site is not that uniform, so you will not get a one fits all approach. To point in a direction that may fit your needs, grab the urls from initial page and iterate them to get information from detail page:

Note: List of links in example is sliced to [:5], to get all results from this iteration simply delete slicing

response = requests.get('https://www.higheredjobs.com/admin/search.cfm', params=params, cookies=cookies, headers=headers)

soup = BeautifulSoup(response.text,'html.parser')#'lxml')#
jobs_list = []
for a in soup.select('.row.record a')[:10]:
    r = requests.get('https://www.higheredjobs.com/' a.get('href'), params=params, cookies=cookies, headers=headers)
    soup = BeautifulSoup(r.text)

    for e in soup.select('span.at'):
        e.decompose()
        
    d = dict((e.text.strip().rstrip(':'),e.next_sibling.strip()) for e in list(soup.select('#jobAttrib strong')))
    d.update({
        'Title':soup.h1.text,
        'Institute':soup.select_one('.job-inst').get_text(strip=True),
        'Location':soup.select_one('.job-loc').get_text(strip=True),
        'Category':soup.select_one('strong:-soup-contains("Category")').find_next().text.strip()
    })

    jobs_list.append(d)


df = pd.DataFrame(jobs_list)
df[['Title', 'Institute','Location', 'Salary','Type', 'Posted', 'Application Due', 'Category']]

Output

	Title	Institute	Location	Salary	Type	Posted	Application Due	Category
0	Academic Advisor	Appalachian State University	Boone, NC	nan	Full-Time	09/04/2022	Open Until Filled	Academic Advising
1	Educational Advisor	Pasadena City College	Pasadena, CA	58,620.12 to 64,628.76 USD Per Year	Full-Time	09/04/2022	nan	Academic Advising
2	Admin and Office Spec III: School of Humanities & Social Sciences	Laurel Ridge Community College	Middletown, VA	nan	Full-Time	09/04/2022	09/16/2022	Academic Advising
3	Post-Licensure Instructional Coordinator and Academic Advisor	Illinois State University	Normal, IL	nan	Full-Time	09/04/2022	nan	Academic Advising
4	College Advising Corps (CAC) Adviser	Georgia State University	Atlanta, GA	nan	Adjunct/Part-Time	09/04/2022	nan	Academic Advising
5	Academic Advisor I ( S03944P)	University of Texas at Arlington	Arlington, TX	nan	Full-Time	09/03/2022	nan	Academic Advising
6	Career & Academic Advisor (Part-Time)	St. Petersburg College	Pinellas Park, FL	nan	Adjunct/Part-Time	09/03/2022	nan	Career Development and Services
7	Academic Advisor	Simmons University	Boston, MA	nan	Full-Time	09/03/2022	nan	Academic Advising
8	Asst Director HSC Roanoke	Radford University	Radford, VA	nan	Full-Time	09/03/2022	nan	Academic Advising
9	Academic Advisor, University Advising Center,Tahlequah	Northeastern State University	Tahlequah, OK	nan	Full-Time	09/03/2022	nan	Academic Advising