I am trying to scrape the names and links of universities from this website: https://www.topuniversities.com/university-rankings/world-university-rankings/2021, and encountered a problem when dealing with pagination, as the href of the button which directs to the next page is javascript:void(0), so I could not reach the next page with scrapy.Request() or response.follow(), is there any way to handle pagination like this?
screen shot of the tag and href
The URL of this website does not contain params, and if the next page button is clicked, the URL remains unchanged, so I could not handle pagination by altering the URL.
The code snippet below can only fetch the names and links of the universities on the first and second page:
import scrapy
from scrapy_splash import SplashRequest
class UniSpider(scrapy.Spider):
name = 'uni'
allowed_domains = ['www.topuniversities.com']
script = """
function main(splash, args)
splash:set_user_agent("Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/101.0.4951.54 Safari/537.36")
splash.private_mode_enabled = false
assert(splash:go(args.url))
assert(splash:wait(3))
return {
html = splash:html()
}
end
"""
next_page = """
function main(splash, args)
splash:set_user_agent("Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/101.0.4951.54 Safari/537.36")
splash.private_mode_enabled = false
assert(splash:go(args.url))
assert(splash:wait(3))
local btn = assert(splash:jsfunc([[
function(){
document.querySelector("#alt-style-pagination a.page-link.next").click()
}
]]))
assert(splash:wait(2))
btn()
splash:set_viewport_full()
assert(splash:wait(3))
return {
html = splash:html()
}
end
"""
def start_requests(self):
yield SplashRequest(
url="https://www.topuniversities.com/university-rankings/world-university-rankings/2021",
callback=self.parse, endpoint="execute",
args={"lua_source": self.script})
def parse(self, response):
for uni in response.css("a.uni-link"):
uni_link = response.urljoin(uni.css("::attr(href)").get())
yield {
"name": uni.css("::text").get(),
"link": uni_link
}
yield SplashRequest(
url=response.url,
callback=self.parse, endpoint="execute",
args={"lua_source": self.next_page}
)
CodePudding user response:
You don't need splash for this simple website.
Try loading following link instead:
https://www.topuniversities.com/sites/default/files/qs-rankings-data/en/2057712.txt
This has all the universities, the website loads this file/json only once and then show information with pagination.
Here is the short code (not using scrapy):
from requests import get
from json import loads, dumps
from lxml.html import fromstring
url = "https://www.topuniversities.com/sites/default/files/qs-rankings-data/en/2057712.txt"
html = get(url, stream=True)
## another approach for loading json
# jdata = loads(html.content.decode())
jdata = html.json()
for x in jdata['data']:
core_id = x['core_id']
country = x['country']
city = x['city']
guide = x['guide']
nid = x['nid']
title = x['title']
logo = x['logo']
score = x['score']
rank_display = x['rank_display']
region = x['region']
stars = x['stars']
recm = x['recm']
dagger = x['dagger']
## convert title to text
soup = fromstring(title)
title = soup.xpath(".//a/text()")[0]
print ( title )
Above code prints 'title' of individual universities, try saving it in CSV/Excel file along with other available columns. Result looks like:
Massachusetts Institute of Technology (MIT)
Stanford University
Harvard University
California Institute of Technology (Caltech)
University of Oxford
ETH Zurich - Swiss Federal Institute of Technology
University of Cambridge
Imperial College London