Home > Software engineering >  Having trouble getting next page in Scrapy
Having trouble getting next page in Scrapy

Time:11-08

I am learning to use scrapy and am building a simple crawler to reinforce what I am learning, and am attempting to get the next page link but am having trouble. Can anyone point me in the right direction of getting the next page link, which is located in the a of the final li

The pagination div is as follows:

<div >
    <ul>
        <li><a href="./viewforum.php?f=399&amp;start=40" data-original-title="" title=""><i
                ></i></a></li>
        <li><a href="./viewforum.php?f=399" data-original-title="" title="">1</a></li>
        <span >, </span>
        <li><a href="./viewforum.php?f=399&amp;start=40" data-original-title="" title="">2</a></li>
        <span >, </span>
        <li ><a data-original-title="" title="">3</a></li>
        <span >, </span>
        <li><a href="./viewforum.php?f=399&amp;start=120" data-original-title="" title="">4</a></li>
        <span >, </span>
        <li><a href="./viewforum.php?f=399&amp;start=160" data-original-title="" title="">5</a></li>
        <span >, </span>
        <li><a href="./viewforum.php?f=399&amp;start=200" data-original-title="" title="">6</a></li>
        <li ><a  href="#" onclick="jumpto(); return false;" title=""
                              data-original-title="Jump to page"> ... </a></li>
        <li><a href="./viewforum.php?f=399&amp;start=311244" data-original-title="" title="">10012</a></li>
        <li><a href="./viewforum.php?f=399&amp;start=120" data-original-title="" title=""><i
                ></i></a></li>
    </ul>
</div>

I have tried different variations of the following, but get the wrong li returned, it still gives me the class=active li even though I used li:not([]): response.css('div.pagination.pagination-small.hidden-phone').css('li:not([])').get()

example:

>>> response.css('div.pagination.pagination-small.hidden-phone').css('li:not([])').get()
'<li ><a>1</a></li>'

Thanks

CodePudding user response:

Since it's the last li on the list we can use this to out advantage.

css:

In [1]: response.css('div.pagination li:last-child a::attr(href)').get()
Out[1]: './viewforum.php?f=399&start=120'

xpath:

In [2]: response.xpath('//div[contains(@class, "pagination")]//li[last()]/a/@href').get()
Out[2]: './viewforum.php?f=399&start=120'
  • Related