Home > database >  Spider closes without error messages and does not scrape all the pages in the pagination (SELENIUM)
Spider closes without error messages and does not scrape all the pages in the pagination (SELENIUM)

Time:12-07

I have created a pipeline to place all the data scrapped into a sqlite database but my spider is not completing the pagination. This is what I get when the spider closes. I should get around 45k results and I am only getting 420. Why could this posibly be?

2021-12-06 14:47:55 [scrapy.core.engine] INFO: Closing spider (finished)
2021-12-06 14:47:55 [selenium.webdriver.remote.remote_connection] DEBUG: DELETE http://127.0.0.1:60891/session/d441b41f-b62b-4c64-a5ef-68329c18dd4e {}
2021-12-06 14:47:56 [urllib3.connectionpool] DEBUG: http://127.0.0.1:60891 "DELETE /session/d441b41f-b62b-4c64-a5ef-68329c18dd4e HTTP/1.1" 200 14
2021-12-06 14:47:56 [selenium.webdriver.remote.remote_connection] DEBUG: Finished Request
2021-12-06 14:47:56 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/response_bytes': 7510132,
 'downloader/response_count': 15,
 'downloader/response_status_count/200': 15,
 'elapsed_time_seconds': 89.469538,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2021, 12, 6, 20, 47, 55, 551566),
 'item_scraped_count': 420,
 'log_count/DEBUG': 577,
 'log_count/INFO': 11,
 'request_depth_max': 14,
 'response_received_count': 15,
 'scheduler/dequeued': 15,
 'scheduler/dequeued/memory': 15,
 'scheduler/enqueued': 15,
 'scheduler/enqueued/memory': 15,
 'start_time': datetime.datetime(2021, 12, 6, 20, 46, 26, 82028)}
2021-12-06 14:47:56 [scrapy.core.engine] INFO: Spider closed (finished)

And this is my spider:

import scrapy
from scrapy_selenium import SeleniumRequest

class HomesSpider(scrapy.Spider):
name = 'homes'

def remove_characters(self,value):
    return value.strip(' m²')

def start_requests(self):
    yield SeleniumRequest(
        url='https://www.vivanuncios.com.mx/s-venta-inmuebles/queretaro/v1c1097l1021p1',
        wait_time=3,
        callback=self.parse
    )

def parse(self, response):
    homes = response.xpath("//div[@id='tileRedesign']/div")
    for home in homes:
        yield {
            'price': home.xpath("normalize-space(.//span[@class='ad-price']/text())").get(), 
            'location': home.xpath(".//div[@class='tile-location one-liner']/b/text()").get(), 
            'description': home.xpath(".//div[@class='tile-desc one-liner']/a/text()").get(),
            'bathrooms': home.xpath("//div[@class='chiplets-inline-block re-bathroom']/text()").get(), 
            'bedrooms': home.xpath(".//div[@class='chiplets-inline-block re-bedroom']/text()").get(),
            'm2': self.remove_characters(home.xpath("normalize-space(.//div[@class='chiplets-inline-block surface-area']/text())").get()),
            'link':home.xpath("//div[@class='tile-desc one-liner']/a/@href").get()
        }
        
    next_page = response.xpath("//a[@class='icon-pagination-right']/@href").get()
    if next_page:
        absolute_url = f"https://www.vivanuncios.com.mx/s-venta-inmuebles/queretaro/v1c1097l1021p1{next_page}"
        yield SeleniumRequest(
            url=absolute_url,
            wait_time=3,
            callback=self.parse,
            dont_filter = True
        )

Could this be explicitly related to my user_agent I have already assigned it to the settings.py anyway or am I being banned from this page? The html of the webpage has no change at all either.

Thanks.

CodePudding user response:

Your code is working fine as your expectation and the problem was in pagination portion and I've made the pagination in starting url which type of pagination is always accurate and more than two times faster than if next page. There are 50 pages and total items scraped count 1400

Script

import scrapy
from scrapy_selenium import SeleniumRequest


class HomesSpider(scrapy.Spider):
    name = 'homes'
    def remove_characters(self, value):
        return value.strip(' m²')

    def start_requests(self):
        urls=[f'https://www.vivanuncios.com.mx/s-venta-inmuebles/queretaro/page-{i}/v1c1097l1021p50'.format(i) for i in range(1,51)]
        for url in urls:
            yield SeleniumRequest(
                url=url,
                wait_time=5,
                callback=self.parse
                )

    def parse(self, response):
        homes = response.xpath("//div[@id='tileRedesign']/div")
        for home in homes:
            yield {
                'price': home.xpath("normalize-space(.//span[@class='ad-price']/text())").get(),
                'location': home.xpath(".//div[@class='tile-location one-liner']/b/text()").get(),
                'description': home.xpath(".//div[@class='tile-desc one-liner']/a/text()").get(),
                'bathrooms': home.xpath("//div[@class='chiplets-inline-block re-bathroom']/text()").get(),
                'bedrooms': home.xpath(".//div[@class='chiplets-inline-block re-bedroom']/text()").get(),
                'm2': self.remove_characters(home.xpath("normalize-space(.//div[@class='chiplets-inline-block surface-area']/text())").get()),
                'link': home.xpath("//div[@class='tile-desc one-liner']/a/@href").get()
            }

Output

{'price': '$3,520,664', 'location': 'Santiago de Querétaro', 'description': 'Paso de los Toros Residencial el Refugio', 'bathrooms': '2', 'bedrooms': '3', 'm2': '151', 'link': '/d-desarrollo-huizache-boutique-homes/613b978bcb0ee8503b6f9f22'}
2021-12-07 06:06:33 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.vivanuncios.com.mx/s-venta-inmuebles/queretaro/page-50/v1c1097l1021p50>
{'price': '$4,690,000', 'location': 'El Refugio', 'description': 'Riaño Residencial el Refugio', 'bathrooms': '2', 'bedrooms': '3', 'm2': '224', 'link': '/d-desarrollo-huizache-boutique-homes/613b978bcb0ee8503b6f9f22'}      
2021-12-07 06:06:33 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.vivanuncios.com.mx/s-venta-inmuebles/queretaro/page-50/v1c1097l1021p50>
{'price': '', 'location': None, 'description': None, 'bathrooms': '2', 'bedrooms': None, 'm2': '', 'link': '/d-desarrollo-huizache-boutique-homes/613b978bcb0ee8503b6f9f22'}
2021-12-07 06:06:33 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.vivanuncios.com.mx/s-venta-inmuebles/queretaro/page-50/v1c1097l1021p50>
{'price': '', 'location': None, 'description': None, 'bathrooms': '2', 'bedrooms': None, 'm2': '', 'link': '/d-desarrollo-rincones-marques/5d6951eee4b05e9aaae12de6'}
2021-12-07 06:06:33 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.vivanuncios.com.mx/s-venta-inmuebles/queretaro/page-50/v1c1097l1021p50>
{'price': '$4,690,000', 'location': 'El Refugio', 'description': 'Riaño Residencial el Refugio', 'bathrooms': '2', 'bedrooms': '3', 'm2': '224', 'link': '/d-desarrollo-huizache-boutique-homes/613b978bcb0ee8503b6f9f22'}      
2021-12-07 06:06:33 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.vivanuncios.com.mx/s-venta-inmuebles/queretaro/page-50/v1c1097l1021p50>
{'price': '', 'location': None, 'description': None, 'bathrooms': '2', 'bedrooms': None, 'm2': '', 'link': '/d-desarrollo-huizache-boutique-homes/613b978bcb0ee8503b6f9f22'}
2021-12-07 06:06:33 [scrapy.core.engine] INFO: Closing spider (finished)
2021-12-07 06:06:33 [selenium.webdriver.remote.remote_connection] DEBUG: DELETE http://127.0.0.1:65206/session/1487a9ea1c9752794aad497613552337 {}
2021-12-07 06:06:33 [urllib3.connectionpool] DEBUG: http://127.0.0.1:65206 "DELETE /session/1487a9ea1c9752794aad497613552337 HTTP/1.1" 200 14
2021-12-07 06:06:33 [selenium.webdriver.remote.remote_connection] DEBUG: Finished Request
2021-12-07 06:06:35 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/response_bytes': 23589849,
 'downloader/response_count': 50,
 'downloader/response_status_count/200': 50,
 'elapsed_time_seconds': 150.933428,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2021, 12, 7, 0, 6, 33, 111357),
 'item_scraped_count': 1400,

.. so on

  • Related