Home > Net >  Why do I get a empty list in scrapy when I use response.css
Why do I get a empty list in scrapy when I use response.css

Time:10-16

The code I used was

import scrapy                                                                             
class JobSpider(scrapy.Spider):
name = 'job'

start_urls = [
    'https://jobs.goodlifefitness.com/listjobs/'
]

In the scrapy shell I put the following code for the link:

response.css('div.jobTitle a::attr(href)')

and I got a " [ ] "

CodePudding user response:

It is because the entire page is rendered from javascript. Once you fetch the request, if you were to open a local file and paste the html content, you will see that 99% of the html is <script> tags. Fortunately these types of pages are easy to scrape with the requests-html library (not to be confused with the requests library).

For example:

pip install requests-html

from requests_html import HTMLSession
import json

session = HTMLSession()
full = []
for i in range(1, 6):
    r = session.get(f"https://jobs.goodlifefitness.com/listjobs/?pg={i}")
    r.html.render()
    lst = r.html.xpath("//div[@class='jobTitle']/a/@href")
    full  = lst
json.dump(full, open("links.json","wt"))

OUTPUT

['/job/16881922/customer-service-representative-motivator-prince-george-river-oint-landing-prince-george-ca/', '/job/16881921/club-attendant-winnipeg-grant-ark-shopping-centre-winnipeg-ca/', '/job/16881919/sales-fitness-advisor-north-york-dufferin-and-finch-north-york-ca/', '/job/16881920/club-attendant-north-york-dufferin-and-finch-north-york-ca/', '/job/16881918/personal-trainer-regina-victoria-square-regina-ca/', '/job/16878045/customer-service-representative-motivator-mississauga-heartland-town-centre-mississauga-ca/', '/job/16878044/club-attendant-brampton-kingspoint-plaza-brampton-ca/', '/job/16878043/sales-fitness-advisor-vaughan-milani-and-highway-27-vaughan-ca/', '/job/16878042/sales-fitness-advisor-calgary-richmond-square-calgary-ca/', '/job/16878041/sales-fitness-advisor-toronto-yonge-and-st-clair-toronto-ca/', '/job/16878040/customer-service-representative-motivator-burlington-appleby-crossing-burlington-ca/', '/job/16878039/personal-trainer-north-york-yonge-and-finch-north-york-ca/', '/job/16873434/sales-and-service-representative-fitness-coach-whitby-taunton-and-brock-for-women-whitby-ca/', '/job/16873435/senior-fitness-coach-whitby-taunton-and-brock-for-women-whitby-ca/', '/job/16873433/club-attendant-brampton-mclaughlin-corners-west-brampton-ca/', '/job/16870781/personal-trainer-windsor-tecumseh-mall-windsor-ca/', '/job/16870780/fit4less-host-saskatoon-circle-west-plaza-saskatoon-ca/', '/job/16866062/service-technician-facility-kitchener-kitchener-ca/', '/job/16866061/service-technician-facility-mississauga-mississauga-ca/', '/job/16866060/sales-fitness-advisor-edmonton-rabbit-hill-road-edmonton-ca/', '/job/16866059/customer-service-representative-motivator-hamilton-queenston-place-hamilton-ca/', '/job/16866058/fit4less-host-markham-cochrane-markham-ca/', '/job/16866057/director-of-digital-marketing-remote-in-canada-london-ca/', '/job/16863233/group-fitness-instructor-bodycombat-edmonton-edmonton-ca/', '/job/16863232/group-fitness-instructor-bodypump-edmonton-edmonton-ca/', '/job/16863231/group-fitness-instructor-bodyattack-edmonton-edmonton-ca/', '/job/16863230/group-fitness-instructor-bodystep-edmonton-edmonton-ca/', '/job/16863228/fit4less-host-north-york-centerpoint-mall-north-york-ca/', '/job/16863227/fit4less-host-oakville-hyde-park-gate-oakville-ca/', '/job/16863226/fitness-manager-kitchener-fairway-plaza-kitchener-ca/', ...

CodePudding user response:

I would highly suggest you to take a look at their backend api. You can do that using the chrome dev tools or a proxy.

This allows you to scrape more data with one request. Most of the time Backend Apis return Json objects which are very nice to work with instead of finding the data within the html file

I've found the backend api for your specific case and wrote a small script that hopefully does what you want. As you can see it scraped 408 Datapoints with one request.

import requests
import json

url = "https://jobsapi-internal.m-cloud.io/api/job?callback=jobsCallback&sortfield=open_date&sortorder=descending&Limit=408&Organization=2239&offset=1"

r = requests.get(url).text

#I know not nice but I was too lazy
r =r.replace("jobsCallback(","")
r =r.replace("}]})","}]}")
json_obj = json.loads(r)
output = list()

for job in json_obj["queryResult"]:
    output.append(job["title"])

#Amount of jobs scraped
print(len(output))
#The available data of each job
print(json_obj["queryResult"][0].keys())
#All the jobs in a dictionary
print(output)

Output:

408
dict_keys(['company_name', 'clientid', 'id', 'xc_id', 'sf_id', 'entity_status', 'scout_orgid', 'scout_userid', 'scout_teamid', 'language', 'ats_portalid', 'industry', 'function', 'title', 'ref', 'primary_city', 'primary_state', 'primary_zip', 'primary_country', 'primary_address', 'primary_location', 'addtnl_locations', 'description', 'primary_category', 'addtnl_categories', 'salary', 'job_type', 'travel', 'level', 'relocation', 'education', 'years_experience', 'open_positions', 'brand', 'department', 'shift', 'recruiter', 'parent_category', 'sub_category', 'business_unit', 'is_internal', 'employment_type', 'schedule', 'compliment', 'store_id', 'close_date', 'open_date', 'fndly_url', 'url', 'seo_url', 'location_type', 'importance', 'is_child_job', 'campaign_id', 'campaign_name', 'publish_to_cws', 'hidden', 'job_classifications', 'easy_apply', 'internal_url', 'internal_description', 'multi_select1', 'multi_select2', 'erp_eligible', 'erp_bonus', 'update_date'])
['Customer Service Representative (Motivator) - Prince George River Point Landing', 'Club Attendant - Winnipeg Grant Park Shopping Centre', 'Sales (Fitness Advisor) - North York Dufferin and Finch', 'Club Attendant - North York Dufferin and Finch', 'Personal Trainer - Regina Victoria Square', 'Customer Service Representative (Motivator) - Mississauga Heartland Town Centre', 'Club Attendant - Brampton Kingspoint Plaza', 'Sales (Fitness Advisor) - Vaughan Milani and Highway 27', 'Sales (Fitness Advisor) - Calgary Richmond Square', 'Sales (Fitness Advisor) - Toronto Yonge and St Clair', 'Customer Service Representative (Motivator) - Burlington Appleby Crossing', 'Personal Trainer - North York Yonge and Finch', 'Sales and Service Representative (Fitness Coach) - Whitby Taunton and Brock For Women', 'Senior Fitness Coach - Whitby Taunton and Brock For Women', 'Club Attendant - Brampton McLaughlin Corners West', 'Personal Trainer - Windsor Tecumseh Mall', 'Fit4Less Host - Saskatoon Circle West Plaza', ...]
  • Related