Home > Enterprise >  How to web scraping BeautifulSoup with attribute application/ld json and data-react-helmet?
How to web scraping BeautifulSoup with attribute application/ld json and data-react-helmet?

Time:04-28

I'm new to web scraping using python. I've coded to pull data from a job portal site using Selenium and BeautifulSoup. The flow I do is:

  1. Scraping the entire a link of job posting on the job portal site
  2. Scraping detailed info from each link of the job posting that has been obtained by looping.

I scraped the detailed info using the find_all BeautifulSoup method on the script tag type = 'application/ld json' and data-react-helmet. But I get an error message list index out of range. Does anyone understand how to solve it?

Message Error

job_main_data = pd.DataFrame()
for i, url in enumerate(URL_job_list):
   headers = {
    'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36',
    'referrer': 'https://google.com',
    'Accept': 
    'text/html,application/xhtml xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
    'Accept-Encoding': 'gzip, deflate, br',
    'Accept-Language': 'en-US,en;q=0.9',
    'Pragma': 'no-cache',
   }
   response = requests.get(url=url, headers=headers)
   soup = BeautifulSoup(response.text, 'lxml')
   script_tags = soup.find_all('script', attrs={'data-react helmet':'true','type':'application/ld json'})
   metadata = script_tags[-1].text

   temp_dict = {}

   try:
     job_info_json = json.loads(metadata, strict=False)
     try:
          jobID = job_info_json['identifier']['value']
          temp_dict['Job ID'] = jobID
          print('Job ID = '    jobID)
     except AttributeError :
          jobID = ''
  
     try:
         jobTitle = job_info_json['title']
         temp_dict['Job Title'] = jobTitle
         print('Title = '    jobTitle)
     except AttributeError :
         jobTitle = ''
      
     try:
         occupationalCategory = job_info_json['occupationalCategory']
         temp_dict['occupationalCategory'] = occupationalCategory
         print('Occupational Category = '    occupationalCategory)
     except AttributeError :
         occupationalCategory = ''
  
     temp_dict['Job Link'] = URL_job_list

     job_main_data = job_main_data.append(temp_dict, ignore_index=True)
      
   except json.JSONDecodeError:
     print("Empty response")

CodePudding user response:

Data is dynamically loaded by Javascript from API calls json response and You can grab all data whatever you want. Below is given an example how to extract data from api using requests module only

import requests
import json

payload={
   "requests":[
      {
         "indexName":"job_postings",
         "params":"query=&hitsPerPage=20&maxValuesPerFacet=1000&page=0&facets=["*","city.work_country_name","position.name","industries.vertical_name","experience","job_type.name","is_salary_visible","has_equity","currency.currency_code","salary_min","taxonomies.slug"]&tagFilters=&facetFilters=[["city.work_country_name:Indonesia"]]"
      },
      {
         "indexName":"job_postings",
         "params":"query=&hitsPerPage=1&maxValuesPerFacet=1000&page=0&attributesToRetrieve=[]&attributesToHighlight=[]&attributesToSnippet=[]&tagFilters=&analytics=false&clickAnalytics=false&facets=city.work_country_name"
      }
   ]
}
headers={'content-type': 'application/x-www-form-urlencoded'}
api_url = "https://219wx3mpv4-dsn.algolia.net/1/indexes/*/queries?x-algolia-agent=Algolia for vanilla JavaScript 3.30.0;JS Helper 2.26.1&x-algolia-application-id=219WX3MPV4&x-algolia-api-key=b528008a75dc1c4402bfe0d8db8b3f8e"

jsonData=requests.post(api_url,data=json.dumps(payload),headers=headers).json()
#print(jsonData)

for item in jsonData['results'][0]['hits']:
    title=item['_highlightResult']['title']['value']
    company=item['_highlightResult']['company']['name']['value']
    skill=item['_highlightResult']['job_skills'][0]['name']['value']
    salary_max=item['salary_max']
    salary_min=item['salary_min']
 

    print(title)

    print(company)

    print(skill)

    print(salary_max)

    print(salary_min)

Output:

Corporate PR
Rocketindo
Sales Strategy & Management
12000000
7000000
Social Media Specialist
Rocketindo
Content Marketing
12000000
7000000
Performance Marketing Analyst (Mama's Choice)
The Parent Inc (theAsianparent)
Marketing Strategy
12000000
5000000
Business Development (Associate Consultant) - CRM
Mekari (PT. Mid Solusi Nusantara)
Business Development & Partnerships
7000000
5000000
Account Payable
Ritase
Corporate Finance
0
0
Data Engineer
Topremit
Databases
0
0
Public Relation KOL
Rocketindo
Business Development & Partnerships
7000000
5000000
Graphic Designer
Rocketindo
Adobe Illustrator
12000000
7000000
Yogyakarta City Coordinator
Deliveree Indonesia
Business Operations
6000000
5250000
Marketing Manager
Deliveree Indonesia
Marketing Strategy
0
0
Graphic Designer
Deliveree Indonesia
Graphic Design
6000000
5250000
Quality Assurance
PT Rekeningku Dotcom Indonesia
Javascript
10000000
4500000
Internship Program
TADA
Attention to Detail
3700000
3000000
Product Management Support
Hangry
Data Warehouse
0
0
Content Writer
Bobobox Indonesia
Copywriting
0
0
UX Researcher
Bobobox Indonesia
UI/UX Design
0
0
UX Copywriter
Bobobox Indonesia
Problem Solving
0
0
Internship HR (Recruitment)
PT Formasi Agung Selaras (Famous Allstars)
Human Resources
1500000
1000000
Fullstack Developer - Banking Industry
SIGMATECH
React.js
12000000
8000000
REACT NATIVE DEVELOPER
BGT Solution
MySQL
16000000
6000000
  • Related