I'm trying to scrape all 22 jobs on this webpage and then a bunch more from other companies that are using the same system to host their jobs.
I can get the first 10 jobs on the page, but the rest have to be loaded 10 at a time by clicking on a 'Show more' button. The URL doesn't change when you do that, and the only change I can see is that a token is added to the payload of the POST request.
Image of Request Payload in Network tool
I've tried following the answers for this stackexchange question and this one but I still can't get it to work.
Here's my current code:
def start_requests(self):
url = 'https://apply.workable.com/api/v3/accounts/so-energy/jobs'
headers = {'authority': 'https://apply.workable.com'}
payload = {
"token":"WzE2NjI2ODE2MDAwMDAsMjY0NTU4N10=",
"query":"",
"location":[],
"department":[],
"worktype":[],
"remote":[]}
yield scrapy.Request(url = url,
method='POST',
headers = headers,
body = json.dumps(payload),
callback = self.parse)
def parse(self, response):
data = json.loads(response.body)
print(data)
This gives me the first 10 jobs, but no more. I get exactly the same result if I remove the payload bits.
Any ideas?
(I'm very new to coding and this is my first question here, so apologies if I've missed something obvious but I've been trying to get this for hours. Thank you!)
CodePudding user response:
You need to get the nextPage
value from the JSON and use it in the payload for the next page.
from json import dumps
import scrapy
class ExampleSpider(scrapy.Spider):
name = 'exampleSpider'
API_url = 'https://apply.workable.com/api/v3/accounts/so-energy/jobs'
custom_settings = {'DOWNLOAD_DELAY': 0.6}
payload = {
"department": [],
"location": [],
"query": "",
"remote": [],
"worktype": []
}
headers = {
"Accept": "application/json, text/plain, */*",
"Accept-Encoding": "gzip, deflate, br",
"Accept-Language": "en",
"Cache-Control": "no-cache",
"Connection": "keep-alive",
"Content-Type": "application/json",
"DNT": "1",
"Host": "apply.workable.com",
"Origin": "https://apply.workable.com",
"Pragma": "no-cache",
"Referer": "https://apply.workable.com/so-energy/",
"Sec-Fetch-Dest": "empty",
"Sec-Fetch-Mode": "cors",
"Sec-Fetch-Site": "same-origin",
"Sec-GPC": "1",
"TE": "trailers",
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36"
}
def start_requests(self):
yield scrapy.Request(url=self.API_url, headers=self.headers, body=dumps(self.payload), method="POST")
def parse(self, response):
# jobs
data = response.json()
for job in data['results']:
yield {'job_details': job}
# next page
if 'nextPage' in data:
self.payload['token'] = data['nextPage']
yield scrapy.Request(url=self.API_url, headers=self.headers, body=dumps(self.payload), method="POST")