Following pagination next page links is not working with scrapy. My CSS selector is not selecting th-CodePudding

I'm trying to scrape https://www.realtor.com/ to get rental information. I want to scrape all pages

I have been continuously having this problem of not being to follow the href to the next page using scrapy. I think my problem is I'm not actually selecting the href of the required a element. Here's my code;

import scrapy

class RealtorScrape(scrapy.Spider):
    name = 'realtor'
    allowed_domains = ['realtor.com']
    start_urls = ['https://www.realtor.com/realestateandhomes-search/Minneapolis_MN/']
    user_agent = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/106.0.0.0 Safari/537.36"
    def parse(self, response):      
        for house in response.css('li.jsx-1881802087 div.jsx-2775064451'):
            house_info = house.css('div.jsx-11645185.card-box')
            status =  house_info.css('div.jsx-11645185 div.jsx-3853574337 div.jsx-3853574337 span.jsx-3853574337::text').get()
            if status == 'For Sale':
                yield {
                    'Status': house_info.css('div.jsx-11645185 div.jsx-3853574337 div.jsx-3853574337 span.jsx-3853574337::text').get(),
                    'Price': house_info.css('div.jsx-11645185.detail-wrap .summary-wrap .property-wrap .srp-page-price span::text').get(),
                    'Beds': ' '.join(house_info.css('div.jsx-11645185.detail-wrap .summary-wrap .property-wrap .prop-meta .srp_listMeta .property-meta-srpPage li[data-label=pc-meta-beds] span::text').getall()),
                    'Baths': ' '.join(house_info.css('div.jsx-11645185.detail-wrap .summary-wrap .property-wrap .prop-meta .srp_listMeta .property-meta-srpPage li[data-label=pc-meta-baths] span::text').getall()),
                    'Square_feet': ' '.join(house_info.css('div.jsx-11645185.detail-wrap .summary-wrap .property-wrap .prop-meta .srp_listMeta .property-meta-srpPage li[data-label=pc-meta-sqft] span::text').getall()),
                    'Accre_lot': ' '.join(house_info.css('div.jsx-11645185.detail-wrap .summary-wrap .property-wrap .prop-meta .srp_listMeta .property-meta-srpPage li[data-label=pc-meta-sqftlot] span::text').getall()),
                    'Location': house_info.css('div.jsx-11645185.detail-wrap .summary-wrap .property-wrap .card-bottom div[data-label=pc-address]::text').get()
                    
                }
        next_page = response.css('div.jsx-1709448077.pagination-wrapper div.styles__StyledPaginator-rui__sc-1vqyfdo-0 a[aria-label=Go to next page]').attrib['href']
        if next_page:
            yield response.follow(next_page, callback=self.parse)

I would like to understand the real reason as to why I'm not able to follow scraping to the rest of the pages. Is my problem with the css selectors I have used to find the next_page href? Is response able to select any element on the page or is it limited?

I want to understand the real basic reason as to why this is failing so that I don't keep making this same mistake over and over again.

Thanks and I would like to get your support

CodePudding user response：

Your next page element selection selects nothing, I tested and it's too long but you can minimize your selection.
Both the following css and xpath experssion select the right element but scrape only 28 items and sometime it's throwing response status errors. The main question is why it's not working properly though its elemenion is right Actually,there is no general rules of of making pagination. The main reason most likey may be we are gettiing requested url after searching by a fixed key words.

and

#response.css('a[aria-label="Go to next page"]').attrib['href']
#response.xpath('//*[contains(text(),"Next")]/@href').get()

Select the correct element but didn't work for the entire pagination

But there is also alternative way to make pagination. As we know the total pages So you can create pagination using range function remember it that this type of pagination two time faster,more accurate and more powerful.

Code:

import scrapy

class RealtorScrape(scrapy.Spider):
    name = 'realtor'
  
    start_urls = [f'https://www.realtor.com/realestateandhomes-search/Minneapolis_MN/pg-{x}' for x in range(1,42)]
    #user_agent = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/106.0.0.0 Safari/537.36"
    def parse(self, response):      
        for house in response.css('li.jsx-1881802087 div.jsx-2775064451'):
            house_info = house.css('div.jsx-11645185.card-box')
            status =  house_info.css('div.jsx-11645185 div.jsx-3853574337 div.jsx-3853574337 span.jsx-3853574337::text').get()
            if status == 'For Sale':
                yield {
                    'Status': house_info.css('div.jsx-11645185 div.jsx-3853574337 div.jsx-3853574337 span.jsx-3853574337::text').get(),
                    'Price': house_info.css('div.jsx-11645185.detail-wrap .summary-wrap .property-wrap .srp-page-price span::text').get(),
                    'Beds': ' '.join(house_info.css('div.jsx-11645185.detail-wrap .summary-wrap .property-wrap .prop-meta .srp_listMeta .property-meta-srpPage li[data-label=pc-meta-beds] span::text').getall()),
                    'Baths': ' '.join(house_info.css('div.jsx-11645185.detail-wrap .summary-wrap .property-wrap .prop-meta .srp_listMeta .property-meta-srpPage li[data-label=pc-meta-baths] span::text').getall()),
                    'Square_feet': ' '.join(house_info.css('div.jsx-11645185.detail-wrap .summary-wrap .property-wrap .prop-meta .srp_listMeta .property-meta-srpPage li[data-label=pc-meta-sqft] span::text').getall()),
                    'Accre_lot': ' '.join(house_info.css('div.jsx-11645185.detail-wrap .summary-wrap .property-wrap .prop-meta .srp_listMeta .property-meta-srpPage li[data-label=pc-meta-sqftlot] span::text').getall()),
                    'Location': house_info.css('div.jsx-11645185.detail-wrap .summary-wrap .property-wrap .card-bottom div[data-label=pc-address]::text').get()
                    
                }

Output:

{'Status': 'For Sale', 'Price': '$789,900', 'Beds': '', 'Baths': '', 'Square_feet': '', 'Accre_lot': '6,970 sqft lot', 'Location': '2208 Irving Ave S'}
2022-10-30 04:40:49 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.realtor.com/realestateandhomes-search/Minneapolis_MN/pg-40>
{'Status': 'For Sale', 'Price': '$189,900', 'Beds': '', 'Baths': '', 'Square_feet': '', 'Accre_lot': '0.24 acre lot', 'Location': '3402 Wilshire Pl NE'}
2022-10-30 04:40:55 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.realtor.com/realestateandhomes-search/Minneapolis_MN/pg-41> (referer: None)
2022-10-30 04:40:55 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.realtor.com/realestateandhomes-search/Minneapolis_MN/pg-41>
{'Status': 'For Sale', 'Price': '$269,900', 'Beds': '', 'Baths': '', 'Square_feet': '', 'Accre_lot': '6,098 sqft lot', 'Location': '4229 France Ave S'}
2022-10-30 04:40:55 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.realtor.com/realestateandhomes-search/Minneapolis_MN/pg-41>
{'Status': 'For Sale', 'Price': '$124,900', 'Beds': '', 'Baths': '', 'Square_feet': '', 'Accre_lot': '6,098 sqft lot', 'Location': '3040 Taylor St NE'}
2022-10-30 04:40:55 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.realtor.com/realestateandhomes-search/Minneapolis_MN/pg-41>
{'Status': 'For Sale', 'Price': '$189,900', 'Beds': '', 'Baths': '', 'Square_feet': '', 'Accre_lot': '8,276 sqft lot', 'Location': '3410 Wilshire Pl NE'}
2022-10-30 04:40:55 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.realtor.com/realestateandhomes-search/Minneapolis_MN/pg-41>
{'Status': 'For Sale', 'Price': '$149,000', 'Beds': '', 'Baths': '', 'Square_feet': '', 'Accre_lot': '0.28 acre lot', 'Location': '4425 Aldrich Ave N'}
2022-10-30 04:40:55 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.realtor.com/realestateandhomes-search/Minneapolis_MN/pg-41>
{'Status': 'For Sale', 'Price': '$42,000', 'Beds': '', 'Baths': '', 'Square_feet': '', 'Accre_lot': '5,227 sqft lot', 'Location': '1519 Oliver Ave N'}
2022-10-30 04:40:55 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.realtor.com/realestateandhomes-search/Minneapolis_MN/pg-41>
{'Status': 'For Sale', 'Price': '$349,900', 'Beds': '', 'Baths': '', 'Square_feet': '', 'Accre_lot': '0.24 acre lot', 'Location': '3403 38th Ave S'}
2022-10-30 04:40:55 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.realtor.com/realestateandhomes-search/Minneapolis_MN/pg-41>
{'Status': 'For Sale', 'Price': '$39,777', 'Beds': '', 'Baths': '', 'Square_feet': '', 'Accre_lot': '5,227 sqft lot', 'Location': '2955 Russell Ave N'}
2022-10-30 04:40:55 [scrapy.core.engine] INFO: Closing spider (finished)
2022-10-30 04:40:55 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 17524,
 'downloader/request_count': 41,
 'downloader/request_method_count/GET': 41,
 'downloader/response_bytes': 6755662,
 'downloader/response_count': 41,
 'downloader/response_status_count/200': 41,
 'elapsed_time_seconds': 247.5606,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2022, 10, 29, 22, 40, 55, 926292),
 'httpcompression/response_bytes': 45965645,
 'httpcompression/response_count': 41,
 'item_scraped_count': 1134,

CodePudding user response：

The reason why your selector is not working is because you need to quote attribute values that have spaces in them.

Your next_page selector :

next_page = response.css('div.jsx-1709448077.pagination-wrapper div.styles__StyledPaginator-rui__sc-1vqyfdo-0 a[aria-label=Go to next page]').attrib['href']

All you need to do to make your selector work is to add quotes the Go to next Page.

for example:

next_page = response.css('div.jsx-1709448077.pagination-wrapper div.styles__StyledPaginator-rui__sc-1vqyfdo-0 a[aria-label="Go to next page"]').attrib['href']

That is the "Real reason" why your selector was not working.