Home > Back-end >  Can't narrow down the search criteria in a web scraper to search "job titles" and cou
Can't narrow down the search criteria in a web scraper to search "job titles" and cou

Time:10-19

For some work I do, I need to gather data regarding job titles and how frequent they are in search results so I decided to enlist Python to help me with this. Only problem is that I can't seem to figure out why this code fragment I found isn't giving me the right info I need. Here's what I have so far:

import requests
from bs4 import BeautifulSoup
from collections import Counter
from string import punctuation

# We get the url
r = requests.get("https://www.usajobs.gov/Search/Results?j=0602&d=VA&p=1")
soup = BeautifulSoup(r.content, "html.parser")


# We get the words within divs
text_div = (''.join(s.findAll(text=True))for s in soup.findAll('div'))
c_div = Counter((x.rstrip(punctuation).lower() for y in text_div for x in y.split()))


total = c_div
print(total)

I know that part of this involves inspecting the code but I can't figure out what I need to input to get the scraper to narrow down to these titles:

<a id="usajobs-search-result-0" href="/GetJob/ViewDetails/568337700" itemprop="title" data-document-id="568337700">

enter image description here

Would appreciate any help

CodePudding user response:

The data is loaded dynamically via sending a POST request to:

https://www.usajobs.gov/Search/ExecuteSearch

See this example to get the correct job titles. (You can change the page key to specify a page number).

import requests


data = {
    "JobTitle": [],
    "GradeBucket": [],
    "JobCategoryCode": ["0602"],
    "JobCategoryFamily": [],
    "LocationName": [],
    "PostingChannel": [],
    "Department": ["VA"],
    "Agency": [],
    "PositionOfferingTypeCode": [],
    "TravelPercentage": [],
    "PositionScheduleTypeCode": [],
    "SecurityClearanceRequired": [],
    "PositionSensitivity": [],
    "ShowAllFilters": [],
    "HiringPath": [],
    "SocTitle": [],
    "MCOTags": [],
    "CyberWorkRole": [],
    "CyberWorkGrouping": [],
    "Page": "1",  # <-- Change page number here
    "UniqueSearchID": "9d417c5e-adc2-469c-af1d-e786cc41bc97",
    "IsAuthenticated": "false",
}


response = requests.post(
    "https://www.usajobs.gov/Search/ExecuteSearch", json=data
).json()

job_titles = [job["Title"] for job in response["Jobs"]]
print(job_titles)

Output:

['Psychiatrist - OCA', 'Physician - Electromyography (Temporary)', 'Physician Owensboro CBOC PC', 'Physician-Primary Care', 'OPHTHALMOLOGIST', 'UROLOGIST', 'PHYSICIAN (OTOLARYNGOLOGIST', 'Physician-Hospitalist', 'Physician - Hemotology/Oncology', 'Academic Gastroenterologist', 'Physician - Gastroenterologist', 'Physician - Orthopedic Surgeon', 'Physician (Internal Medicine or Family Practice)', 'Physician (Regular Ft)- Hematologist/Oncologist', 'Physician- Hematologist/Oncologist', 'Physician - Diagnostic Radiologist', 'Physician (Psychiatrist)', 'Physician (Endocrinologist)', 'Physician (Cardiologist)', 'Physician (Neurologist)', 'Physician (Chief Hospitalist)', 'Physician (Hospitalist)', 'Physician (Medical Director of Extended Care/Chief of Geriatrics)', 'Physician (Primary Care)', 'Physician (Hematologist/Oncologist)']
  • Related