Home > OS >  How can I handle pagination with Scrapy and Splash, if the href of the button is javascript:void(0)
How can I handle pagination with Scrapy and Splash, if the href of the button is javascript:void(0)

Time:05-13

I am trying to scrape the names and links of universities from this website: https://www.topuniversities.com/university-rankings/world-university-rankings/2021, and encountered a problem when dealing with pagination, as the href of the button which directs to the next page is javascript:void(0), so I could not reach the next page with scrapy.Request() or response.follow(), is there any way to handle pagination like this?

screen shot of the website

screen shot of the tag and href

The URL of this website does not contain params, and if the next page button is clicked, the URL remains unchanged, so I could not handle pagination by altering the URL.

The code snippet below can only fetch the names and links of the universities on the first and second page:

import scrapy
from scrapy_splash import SplashRequest


class UniSpider(scrapy.Spider):
    name = 'uni'
    allowed_domains = ['www.topuniversities.com']

    script = """
    function main(splash, args)
      splash:set_user_agent("Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/101.0.4951.54 Safari/537.36")
      splash.private_mode_enabled = false
      assert(splash:go(args.url))
      assert(splash:wait(3))

      return {
        html = splash:html()
      }
    end
    """

    next_page = """
    function main(splash, args)
        splash:set_user_agent("Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/101.0.4951.54 Safari/537.36")
        splash.private_mode_enabled = false
        assert(splash:go(args.url))
        assert(splash:wait(3))

        local btn = assert(splash:jsfunc([[
        function(){
        document.querySelector("#alt-style-pagination a.page-link.next").click()
      }
        ]]))
        assert(splash:wait(2))
        btn()

        splash:set_viewport_full()
        assert(splash:wait(3))

        return {
          html = splash:html()
        }
    end
    """

    def start_requests(self):
        yield SplashRequest(
            url="https://www.topuniversities.com/university-rankings/world-university-rankings/2021",
            callback=self.parse, endpoint="execute",
            args={"lua_source": self.script})

    def parse(self, response):
        for uni in response.css("a.uni-link"):
            uni_link = response.urljoin(uni.css("::attr(href)").get())
            yield {
                "name": uni.css("::text").get(),
                "link": uni_link
            }

        yield SplashRequest(
            url=response.url,
            callback=self.parse, endpoint="execute",
            args={"lua_source": self.next_page}
        )

CodePudding user response:

You don't need splash for this simple website.

Try loading following link instead:

https://www.topuniversities.com/sites/default/files/qs-rankings-data/en/2057712.txt

This has all the universities, the website loads this file/json only once and then show information with pagination.

Here is the short code (not using scrapy):

from requests import get
from json import loads, dumps
from lxml.html import fromstring

url = "https://www.topuniversities.com/sites/default/files/qs-rankings-data/en/2057712.txt"
html = get(url, stream=True)

## another approach for loading json
# jdata = loads(html.content.decode())

jdata = html.json()
for x in jdata['data']:
    core_id = x['core_id']
    country = x['country']
    city = x['city']
    guide = x['guide']
    nid = x['nid']
    title = x['title']
    logo = x['logo']
    score = x['score']
    rank_display = x['rank_display']
    region = x['region']
    stars = x['stars']
    recm = x['recm']
    dagger = x['dagger']

    ## convert title to text
    soup = fromstring(title)
    title = soup.xpath(".//a/text()")[0]

    print ( title )

Above code prints 'title' of individual universities, try saving it in CSV/Excel file along with other available columns. Result looks like:

Massachusetts Institute of Technology (MIT) 
Stanford University
Harvard University
California Institute of Technology (Caltech)
University of Oxford
ETH Zurich - Swiss Federal Institute of Technology
University of Cambridge
Imperial College London
  • Related