I'm trying to scrape this website https://www.pararius.com/english to get rental information. I want to scrape all pages on this site.
I've looked through similar questions on stackoverflow regarding scrapy pagination issues, but none seem to reflect my issue.
Everything in my code works except for the section where I want to follow 'next_page' links. I have written another spider for another book website using the exact same concept and it works perfectly. I'm failing to join the next_page link to the start url and have scrapy automatically scrape the next page.
Here's my code:
import scrapy
from time import sleep
class ParariusScraper(scrapy.Spider):
name = 'pararius'
start_urls = ['https://www.pararius.com/apartments/amsterdam/']
def parse(self, response):
base_url = 'https://www.pararius.com/apartments/amsterdam'
for section in response.css('section.listing-search-item'):
yield {
'Title': section.css('h2.listing-search-item__title > a::text').get().strip(),
'Location': section.css('div.listing-search-item__sub-title::text').get().strip(),
'Price': section.css('div.listing-search-item__price::text').get().strip(),
'Size': section.css('li.illustrated-features__item::text').get().strip(),
'Link':f"{base_url}{section.css('h2.listing-search-item__title a').attrib['href']}"
}
sleep(1)
next_page = response.css('li.pagination__item a').attrib['href'].split('/')[-1]
print(next_page)
if next_page:
yield response.follow(next_page, self.parse)
When I run this code, the crazy think that happens is that my code only scrapes page-2 results and not even the first page which is the start_url as seen in my code.
I would like to know how I can fix this and have my code start working as expected. Thanks and I hope to get your support.
CodePudding user response:
I managed to get it to work using the example below. There is an issue with your css selector for the next page, and it's much easier to use response.urljoin()
for relative links rather than doing all of the parsing yourself. You also need to dedent your request for the next page to outside of the for loop otherwise you will be sending identical requests for each iteration of the loop.
import scrapy
class ParariusScraper(scrapy.Spider):
name = 'pararius'
start_urls = ['https://www.pararius.com/apartments/amsterdam/']
def parse(self, response):
for section in response.css('section.listing-search-item'):
yield {
'Title': section.css('h2.listing-search-item__title > a::text').get().strip(),
'Location': section.css('div.listing-search-item__sub-title::text').get().strip(),
'Price': section.css('div.listing-search-item__price::text').get().strip(),
'Size': section.css('li.illustrated-features__item::text').get().strip(),
'Link':f"{self.start_urls[0]}{section.css('h2.listing-search-item__title a').attrib['href']}"
}
next_page = response.css('.pagination__link.pagination__link--next').attrib['href']
yield response.follow(response.urljoin(next_page), self.parse)
CodePudding user response:
The following code's pagination isn't throwing any exception
import scrapy
class ParariusScraper(scrapy.Spider):
name = 'pararius'
start_urls = ['https://www.pararius.com/apartments/amsterdam/']
def parse(self, response):
for section in response.css('section.listing-search-item'):
yield {
'Title': section.css('h2.listing-search-item__title > a::text').get().strip(),
'Location': section.css('div.listing-search-item__sub-title::text').get().strip(),
'Price': section.css('div.listing-search-item__price::text').get().strip(),
'Size': section.css('li.illustrated-features__item::text').get().strip(),
'Link':f"{self.start_urls[0]}{section.css('h2.listing-search-item__title a').attrib['href']}"
}
next_page = response.css('a:contains(Next)::attr(href)').get()
if next_page:
yield response.follow(next_page, callback=self.parse)