Home > Back-end >  Next page issues with scrapy python (json)
Next page issues with scrapy python (json)

Time:12-21

I'm trying to feed postcodes from a list and is not working well (inside a class). Start_urls take sa1, sa2, sa3 as expected but pass only 'sa3' (last one) inside the def, and next_pages gets only 'sa3'.

This is my code:

Class OnthemarketSpider(scrapy.Spider):
    name = 'onthemarket'
    allowed_domains = ['onthemarket.com']

    postcodes = ('sa1'), ('sa2'), ('sa3')
    for postcode in postcodes:


        start_urls = [f'https://www.onthemarket.com/async/search/properties/?search-type=for-sale&location-id={postcode}&sort-field=keywords&under-offer=true&view=grid']

        def parse(self, response):
            data = json.loads(response.body)
            properties = data.get('properties')
            for property in properties:
                yield {
                    'id': property.get('id'),
                    'price': property.get('price'),
                    'title': property.get('property-title'),
                    'url': response.urljoin(property.get('property-link'))
                }

            pages = int(100 / 23)
            postcode = self.postcode

            for number in range(1, pages  1):
                next_page = f"https://www.onthemarket.com/async/search/properties/?search-type=for-sale&location-id={postcode}&page={number}&sort-field=keywords&under-offer=true&view=grid"
                yield scrapy.Request(next_page, callback=self.parse)

I want to achieve this result, if possible.

This is start URL:  ['https://www.domainname-id=sa1&view=grid']
This is next page:  https://www.domainname-id=sa1&page=1&view=grid
This is next page:  https://www.domainname-id=sa1&page=2&view=grid
This is next page:  https://www.domainname-id=sa1&page=3&view=grid
This is start URL:  ['https://www.domainname-id=sa2&view=grid']
This is next page:  https://www.domainname-id=sa2&page=1&view=grid
This is next page:  https://www.domainname-id=sa2&page=2&view=grid
This is next page:  https://www.domainname-id=sa2&page=3&view=grid
This is start URL:  ['https://www.domainname-id=sa3&view=grid']
This is next page:  https://www.domainname-id=sa3&page=1&view=grid
This is next page:  https://www.domainname-id=sa3&page=2&view=grid
This is next page:  https://www.domainname-id=sa3&page=3&view=grid

Thanks for your time.

CodePudding user response:

You create start_urls list and overwrite it again and again so you get only the last url. Instead you need to append to it:

start_urls = []

for postcode in postcodes:
    start_urls.append(f'https://www.onthemarket.com/async/search/properties/?search-type=for-sale&location-id={postcode}&sort-field=keywords&under-offer=true&view=grid')

EDIT:

The complete code:

import scrapy
import json


class OnthemarketSpider(scrapy.Spider):
    name = 'onthemarket'
    allowed_domains = ['onthemarket.com']

    postcodes = ('sa1'), ('sa2'), ('sa3')
    start_urls = []

    for postcode in postcodes:
        start_urls.append(f'https://www.onthemarket.com/async/search/properties/?search-type=for-sale&location-id={postcode}&sort-field=keywords&under-offer=true&view=grid')

    def parse(self, response):
        data = json.loads(response.body)
        properties = data.get('properties')
        for property in properties:
            yield {
                'id': property.get('id'),
                'price': property.get('price'),
                'title': property.get('property-title'),
                'url': response.urljoin(property.get('property-link'))
            }

        pages = int(100 / 23)
        postcode = self.postcode

        for number in range(1, pages   1):
            next_page = f"https://www.onthemarket.com/async/search/properties/?search-type=for-sale&location-id={postcode}&page={number}&sort-field=keywords&under-offer=true&view=grid"
            yield scrapy.Request(next_page, callback=self.parse)

EDIT 2:

import scrapy
import json


class OnthemarketSpider(scrapy.Spider):
    name = 'onthemarket'
    allowed_domains = ['onthemarket.com']

    postcodes = ('sa1'), ('sa2'), ('sa3')
    start_urls = []

    for postcode in postcodes:
        start_urls.append(f'https://www.onthemarket.com/async/search/properties/?search-type=for-sale&location-id={postcode}&sort-field=keywords&under-offer=true&view=grid')

    def parse(self, response):
        data = json.loads(response.body)
        properties = data.get('properties')
        for property in properties:
            yield {
                'id': property.get('id'),
                'price': property.get('price'),
                'title': property.get('property-title'),
                'url': response.urljoin(property.get('property-link'))
            }

        # pages = int(100 / 23)
        pages = 4   # int(100/23) = 4
        postcode = self.postcode    # always 'sa3'

        for number in range(1, pages   1):
            next_page = f'{response.url}&page={number}'
            yield scrapy.Request(next_page, callback=self.parse)

EDIT 3:

import scrapy
import json
import re


class OnthemarketSpider(scrapy.Spider):
    name = 'onthemarket'
    allowed_domains = ['onthemarket.com']

    postcodes = ('sa1'), ('sa2'), ('sa3')
    start_urls = []

    for postcode in postcodes:
        start_urls.append(f'https://www.onthemarket.com/async/search/properties/?search-type=for-sale&location-id={postcode}&sort-field=keywords&under-offer=true&view=grid&page=1')

    def parse(self, response):
        data = json.loads(response.body)
        properties = data.get('properties')
        for property in properties:
            yield {
                'id': property.get('id'),
                'price': property.get('price'),
                'title': property.get('property-title'),
                'url': response.urljoin(property.get('property-link'))
            }

        pages = 4   # int(100/23) = 4

        for number in range(1, pages   1):
            next_page = re.sub(r'page=\d ', f'page={number}', response.url)
            yield scrapy.Request(next_page, callback=self.parse)
  • Related