I am trying to find the salary, closing date, and URL for each job post on a website. I am successful to the part of getting information on the main page. But I am failing to get information for each job post.
url = https://www.higheredjobs.com/admin/search.cfm?JobCat=141&CatName=Academic Advising
expected output: df =
title | university | location | study | date | url | salary | closing date | |
---|---|---|---|---|---|---|---|---|
0 | Educational Advisor | Pasadena City College | Pasadena, CA | Academic Advising | 09/04/22 | https://www.higheredjobs.com/admin/details.cfm?JobCode=178087599&Title=Educational Advisor | $58,620.12 to 64,628.76 | 9/25/2022 11:59 PM Pacific |
1 | Admin and Office Spec III: School of Humanities & Social Sciences | Laurel Ridge Community College | Middletown, VA | Academic Advising | 09/04/22 | https://www.higheredjobs.com/admin/details.cfm?JobCode=178087578&Title=Admin and Office Spec III: School of Humanities & Social Sciences | $32,000-$48,000 | 09/16/2022 |
code:
from pprint import pprint
import datetime
import requests
from bs4 import BeautifulSoup
cookies = {
'CFID': '180615757',
'CFTOKEN': '64089929988eb934-58E2ACC9-AD21-785B-2AFBCE86106B41FE',
'visid_incap_2388351': '0Vmr7QpDRvmVw8fbXUJFkB5XEWMAAAAAQUIPAAAAAADtlXunU/D8GLU5VofHHier',
'_ga_6ZQNJ4ELG2': 'GS1.1.1662315508.15.1.1662315668.0.0.0',
'_ga': 'GA1.2.147261521.1662080801',
'_gid': 'GA1.2.1149490171.1662080801',
'reese84': '3:yMGXsdMquwoCj3IoSFRCMg==:Vf20HwL77P8oWYTTKbE0XigwyQE3d2lLQpPVoZYcoL8SJTmLeqAani 7GspfC2BiJYOOytBlkIp9MewLgs/XbkaiLrSvLnMdZ0aT8/M9FvBohByybnJXNl25ya/yfpGhL9oT1HKMZYnKqSR0Sg8 nHTUEO0/YErJgQmfoeYIT4kmE01S8cndGIemtuGjvq1hzB/D9VAQL7S3idutOumBNu84j5FyCdOBClCJTriE X9j40lj1swIxFlryTmBAtLHnEvN9M57N4LMb13yuSBaCawrv4fnron0JnUvfKpLU0CXTnpcM9hJNGv9Ekb4Ap43CZDPdeLVzEmj 39wCVtXPtMqBNCU6mPVBSeJCRHyRuQjY y0Sv5w7ME2LXhT8bEGHyE8yeuxddxvoG51STebu pb0mSp5n iKotUEn9h sA=:WH64twwKGqtE4pUorYOeGylONeXRsfG 3Qe3zAfpdrs=',
'__atuvc': '65|35,2|36',
'COOKIESTATUS': 'ON',
'HIDECOOKIEBANNER': 'TRUE',
'nlbi_2388351': 'jGGxMFazFBqnU x okRrFAAAAAC/AJ/k R2U vs5Q4LIRTS7',
'nlbi_2388351_2147483392': 'PUildkEvtiZ9uje3okRrFAAAAABv1NR/7gPLX7Lc/iS5ei8N',
'incap_ses_989_2388351': 'mWy Uq7aLX000xomDaO5DfTrFGMAAAAA6XmB42vG5CO6i609/RhyKg==',
'incap_ses_468_2388351': 'sDNcR2labTHyNXYlUqx BipAFGMAAAAAImV2A07lGANZGfpvhvPlLg==',
'__atuvs': '6314ec0cdbe92a78001',
'_gat_gtag_UA_12825325_1': '1',
}
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:104.0) Gecko/20100101 Firefox/104.0',
'Accept': 'text/html,application/xhtml xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8',
'Accept-Language': 'en-US,en;q=0.5',
# 'Accept-Encoding': 'gzip, deflate, br',
'Referer': 'https://www.higheredjobs.com/admin/',
'Connection': 'keep-alive',
# Requests sorts cookies= alphabetically
# 'Cookie': 'CFID=180615757; CFTOKEN=64089929988eb934-58E2ACC9-AD21-785B-2AFBCE86106B41FE; visid_incap_2388351=0Vmr7QpDRvmVw8fbXUJFkB5XEWMAAAAAQUIPAAAAAADtlXunU/D8GLU5VofHHier; _ga_6ZQNJ4ELG2=GS1.1.1662315508.15.1.1662315668.0.0.0; _ga=GA1.2.147261521.1662080801; _gid=GA1.2.1149490171.1662080801; reese84=3:yMGXsdMquwoCj3IoSFRCMg==:Vf20HwL77P8oWYTTKbE0XigwyQE3d2lLQpPVoZYcoL8SJTmLeqAani 7GspfC2BiJYOOytBlkIp9MewLgs/XbkaiLrSvLnMdZ0aT8/M9FvBohByybnJXNl25ya/yfpGhL9oT1HKMZYnKqSR0Sg8 nHTUEO0/YErJgQmfoeYIT4kmE01S8cndGIemtuGjvq1hzB/D9VAQL7S3idutOumBNu84j5FyCdOBClCJTriE X9j40lj1swIxFlryTmBAtLHnEvN9M57N4LMb13yuSBaCawrv4fnron0JnUvfKpLU0CXTnpcM9hJNGv9Ekb4Ap43CZDPdeLVzEmj 39wCVtXPtMqBNCU6mPVBSeJCRHyRuQjY y0Sv5w7ME2LXhT8bEGHyE8yeuxddxvoG51STebu pb0mSp5n iKotUEn9h sA=:WH64twwKGqtE4pUorYOeGylONeXRsfG 3Qe3zAfpdrs=; __atuvc=65|35,2|36; COOKIESTATUS=ON; HIDECOOKIEBANNER=TRUE; nlbi_2388351=jGGxMFazFBqnU x okRrFAAAAAC/AJ/k R2U vs5Q4LIRTS7; nlbi_2388351_2147483392=PUildkEvtiZ9uje3okRrFAAAAABv1NR/7gPLX7Lc/iS5ei8N; incap_ses_989_2388351=mWy Uq7aLX000xomDaO5DfTrFGMAAAAA6XmB42vG5CO6i609/RhyKg==; incap_ses_468_2388351=sDNcR2labTHyNXYlUqx BipAFGMAAAAAImV2A07lGANZGfpvhvPlLg==; __atuvs=6314ec0cdbe92a78001; _gat_gtag_UA_12825325_1=1',
'Upgrade-Insecure-Requests': '1',
'Sec-Fetch-Dest': 'document',
'Sec-Fetch-Mode': 'navigate',
'Sec-Fetch-Site': 'same-origin',
# Requests doesn't support trailers
# 'TE': 'trailers',
}
params = {
'JobCat': '141',
'CatName': 'Academic Advising',
}
response = requests.get('https://www.higheredjobs.com/admin/search.cfm', params=params, cookies=cookies, headers=headers)
soup = BeautifulSoup(response.text,'html.parser')#'lxml')#
jobs_list = []
for i in soup.select('.row.record'):
jobs_list.append(dict(zip(['title','university','location','study','date'],i.stripped_strings)))
df = pd.DataFrame(jobs_list)
Present output: df =
title | university | location | study | date | url | salary | closing date | |
---|---|---|---|---|---|---|---|---|
0 | Educational Advisor | Pasadena City College | Pasadena, CA | Academic Advising | 09/04/22 | |||
1 | Admin and Office Spec III: School of Humanities & Social Sciences | Laurel Ridge Community College | Middletown, VA | Academic Advising | 09/04/22 |
Problem:
I could not scrap the sub-pages and information for each post. How to scrap sub-pages and required information.
CodePudding user response:
Information on this site is not that uniform, so you will not get a one fits all approach. To point in a direction that may fit your needs, grab the urls from initial page and iterate them to get information from detail page:
Note: List of links in example is sliced to [:5]
, to get all results from this iteration simply delete slicing
response = requests.get('https://www.higheredjobs.com/admin/search.cfm', params=params, cookies=cookies, headers=headers)
soup = BeautifulSoup(response.text,'html.parser')#'lxml')#
jobs_list = []
for a in soup.select('.row.record a')[:10]:
r = requests.get('https://www.higheredjobs.com/' a.get('href'), params=params, cookies=cookies, headers=headers)
soup = BeautifulSoup(r.text)
for e in soup.select('span.at'):
e.decompose()
d = dict((e.text.strip().rstrip(':'),e.next_sibling.strip()) for e in list(soup.select('#jobAttrib strong')))
d.update({
'Title':soup.h1.text,
'Institute':soup.select_one('.job-inst').get_text(strip=True),
'Location':soup.select_one('.job-loc').get_text(strip=True),
'Category':soup.select_one('strong:-soup-contains("Category")').find_next().text.strip()
})
jobs_list.append(d)
df = pd.DataFrame(jobs_list)
df[['Title', 'Institute','Location', 'Salary','Type', 'Posted', 'Application Due', 'Category']]
Output
Title | Institute | Location | Salary | Type | Posted | Application Due | Category | |
---|---|---|---|---|---|---|---|---|
0 | Academic Advisor | Appalachian State University | Boone, NC | nan | Full-Time | 09/04/2022 | Open Until Filled | Academic Advising |
1 | Educational Advisor | Pasadena City College | Pasadena, CA | 58,620.12 to 64,628.76 USD Per Year | Full-Time | 09/04/2022 | nan | Academic Advising |
2 | Admin and Office Spec III: School of Humanities & Social Sciences | Laurel Ridge Community College | Middletown, VA | nan | Full-Time | 09/04/2022 | 09/16/2022 | Academic Advising |
3 | Post-Licensure Instructional Coordinator and Academic Advisor | Illinois State University | Normal, IL | nan | Full-Time | 09/04/2022 | nan | Academic Advising |
4 | College Advising Corps (CAC) Adviser | Georgia State University | Atlanta, GA | nan | Adjunct/Part-Time | 09/04/2022 | nan | Academic Advising |
5 | Academic Advisor I ( S03944P) | University of Texas at Arlington | Arlington, TX | nan | Full-Time | 09/03/2022 | nan | Academic Advising |
6 | Career & Academic Advisor (Part-Time) | St. Petersburg College | Pinellas Park, FL | nan | Adjunct/Part-Time | 09/03/2022 | nan | Career Development and Services |
7 | Academic Advisor | Simmons University | Boston, MA | nan | Full-Time | 09/03/2022 | nan | Academic Advising |
8 | Asst Director HSC Roanoke | Radford University | Radford, VA | nan | Full-Time | 09/03/2022 | nan | Academic Advising |
9 | Academic Advisor, University Advising Center,Tahlequah | Northeastern State University | Tahlequah, OK | nan | Full-Time | 09/03/2022 | nan | Academic Advising |