Home > Software design >  Convert relative URL in Scrapy Crawler Rule to absolute URL
Convert relative URL in Scrapy Crawler Rule to absolute URL

Time:12-31

I am trying to create a crawler with this rule that will click into the page of each property and get the details. But the URL is a relative URL which cannot be used in Scrapy Crawler Rule as it only accepts absolute URL. This is the solution I came up with using process_value but it doesn't work. Can anyone help suggest another method to solve this, thanks!

Here is the code currently:

import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule


class EdgepropSpider(CrawlSpider):
    name = 'edgeprop'
    allowed_domains = ['edgeprop.my']
    start_urls = ['https://www.edgeprop.my/buy/malaysia/all-residential']

    rules = (
        Rule(LinkExtractor(restrict_xpaths=("//div[@class='card tep-listing-card']/a/@href"), process_value= lambda x: 'https://edgeprop.my' x), callback='parse_item', follow=True),
        #Rule(LinkExtractor(restrict_xpaths=("//nav[@aria-label='Listing Page navigation']//li[position() = last()]/a")), follow=True)
    )

    def parse_item(self, response):
        yield {
            'Name': response.xpath("//div[@class='save-share']/following-sibling::h1/text()").get()
        }    

Here is the output:

2021-12-29 10:42:18 [scrapy.core.engine] INFO: Spider opened
2021-12-29 10:42:18 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2021-12-29 10:42:18 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2021-12-29 10:42:18 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.edgeprop.my/buy/malaysia/all-residential> (referer: None)
2021-12-29 10:42:18 [scrapy.core.engine] INFO: Closing spider (finished)
2021-12-29 10:42:18 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 328,
 'downloader/request_count': 1,
 'downloader/request_method_count/GET': 1,
 'downloader/response_bytes': 4126,
 'downloader/response_count': 1,
 'downloader/response_status_count/200': 1,
 'elapsed_time_seconds': 0.237148,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2021, 12, 29, 2, 42, 18, 936521),
 'httpcompression/response_bytes': 10918,
 'httpcompression/response_count': 1,
 'log_count/DEBUG': 1,
 'log_count/INFO': 10,
 'response_received_count': 1,
 'scheduler/dequeued': 1,
 'scheduler/dequeued/memory': 1,
 'scheduler/enqueued': 1,
 'scheduler/enqueued/memory': 1,
 'start_time': datetime.datetime(2021, 12, 29, 2, 42, 18, 699373)}
2021-12-29 10:42:18 [scrapy.core.engine] INFO: Spider closed (finished)

CodePudding user response:

This is much easier than you think; you can begin with the response of the webpage and scraping via json requests.

The next pages are also provided by the payload, and all the properties per page. I have built a simple scraper that grabs all the responses, you'll just have to parse the dictionary. Mind you, you'll likely get redirected so I've included a DOWNLOAD_DELAY which may help. Everything else is self-explanatory.

import scrapy
from scrapy.loader import ItemLoader
from itemloaders.processors import TakeFirst
from scrapy.item import Field
from scrapy.crawler import CrawlerProcess
from scrapy.http.request.form import FormRequest
from scrapy_splash import SplashRequest
from scrapy_splash.request import SplashFormRequest
class MalItem(scrapy.Item):
    listings = Field(output_processor = TakeFirst())


class MalSpider(scrapy.Spider):
    name = 'Mala'
    
    #start_urls = []
    start_urls = ['https://www.edgeprop.my/jwdsonic/api/v1/property/search?&listing_type=sale&state=Malaysia&property_type=rl&start=0&size=20']
    #for i in range(0, 5):
    #    start_urls.append(f'https://www.edgeprop.my/jwdsonic/api/v1/property/search?&listing_type=sale&state=Malaysia&property_type=rl&start={i}&size=20')

    custom_settings = {
        #'LOG_LEVEL': 'CRITICAL',
        'ROBOTSTXT_OBEY': False,
        'USER_AGENT': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36',
        'CONCURRENT_REQUESTS': 100,
        'CONCURRENT_REQUESTS_PER_IP': 100,
        'DOWNLOAD_DELAY':3
    }
    def start_requests(self):
        for url in self.start_urls:
            for i in range(0, 7709):
                yield scrapy.FormRequest(
                url, 
                method='GET',
                formdata = {
                    'listing_type': 'sale',
                    'state': 'Malaysia',
                    'property_type': 'rl',
                    'start': str(i),
                    'size': '20'
                },
                callback = self.parse
            )
    
    def parse(self, response):
        links = response.json().get('property')
        for stuff in links:
            loader = ItemLoader(MalItem())
            loader.add_value('listings', stuff)
            yield loader.load_item()

process = CrawlerProcess(

    settings = {
        "FEED_URI":'stuff.jl',
        "FEED_FORMAT":'jsonlines'
    }
)
  • Related