Home > Software engineering >  Scrapy Spider Trouble Navigating Through URLs
Scrapy Spider Trouble Navigating Through URLs

Time:10-24

I have been struggling to find a way to go about this issue: (the functions I may show do not work and are wrong but it is the more the process that I am confused about)

I am trying to have my spider get the prices for all of the products on the "standard-sheds" page. This is the link to the page which contains the products: https://www.charnleys.co.uk/product-category/gardening/garden-accessories/garden-furniture/sheds/standard-sheds/

However, if you are to click on the product link, you would see that the path changes to "charnleys.co.uk/shop/shed-product-name" so my spider can't follow.

What I have thought about doing is collecting the URLs on the "standard-sheds" page, appending them to an array and iterating through, then having my spider go onto those URLs and collecting the price. However, I am unsure as to how I get my spider to go through the array of URLs. I will list the current functions I have created.

Any help is greatly appreciated.

from gc import callbacks
import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor

urls = []

class CharnleySpider(CrawlSpider):
    name = 'crawler'
    allowed_domains = ['charnleys.co.uk']
    start_urls = ['https://www.charnleys.co.uk']

#https://www.charnleys.co.uk/product-category/gardening/garden-accessories/garden-furniture/sheds/standard-sheds/
#https://www.charnleys.co.uk/shop/bentley-supreme-apex/

        rules = (
        Rule(LinkExtractor(allow='product-category/gardening/garden-accessories/garden- 
       furniture/sheds', deny='sheds')),
        Rule(LinkExtractor(allow='standard-sheds'), callback='collect_urls')
        )

    def collect_urls(self, response):
        for elements in response.css('div.product-image'):
            urls.append(elements.css('div.product-image a::attr(href)').get())

    def html_return_price_strings(self, response): 

        #Searches through html of webpage and returns all string with "£" attatched.
        all_html = response.css('html').get()
        for line in all_html.split('\n'):
            for word in line.split():
                if word.startswith('£'):
                    print (word)

    def parse_product(self, response, html_return_price_strings): 
         yield {
             'name' : response.css('h2.product_title::text').get(),
             'price' : html_return_price_strings()
                 
         }

CodePudding user response:

When you will start to journey to the each listing page/details page and after reaching to the details page and if you turn off JS then you will notice that the price portion aka the contents from the page has gone disappereared meaning dynamically loaded by JavaScript.So Scrapy can't render JS but you can grab that dynamic content via scrapy-SeleniumRequest. Here I use scrapy default spider which is more robust than crawlSpider.

Code:

import scrapy
from scrapy_selenium import SeleniumRequest
from selenium.webdriver.common.by import By

class Test2Spider(scrapy.Spider):
    name = 'test2'
    start_urls = [f'https://www.charnleys.co.uk/product-category/gardening/garden-accessories/garden-furniture/sheds/standard-sheds/page/{x}/' for x in range(1,3)]

    def start_requests(self):
        for url in self.start_urls:
            yield scrapy.Request (
                url = url,
                callback = self.parse
                )

    def parse(self, response):
        
        for link in response.xpath('//*[@]/@href')[1:10].getall():
            yield SeleniumRequest(
                url = link,
                callback=self.parse_product
            )
    
    def parse_product(self, response):
        driver = response.meta['driver']
       
        yield {
             'name' : response.css('h2.product_title::text').get().strip(),
             'price' : ''.join([x.text.split('STANDARD FEATURES')[0].split('Framing')[0].split('Standard Features:')[0].split('Specification:')[0] for x in driver.find_elements(By.XPATH, '//*[@id="tab-description"]/p | //*[@id="tab-description"]/div[1]/div')]),
             'url': response.url
                 
         }

Output:

{'name': 'Cabin Shed', 'price': '8FT Gables:\n5 x 8 £1099\n6 x 8 £1143\n8 x 8 £1370\n10 x 8 £1597\n12 x 8 £1824\n14 
x 8 £205110FT Gables\n5 x 10 £1368\n6 x 10 £1443\n8 x 10 £1772\n10 x 10 £2100\n12 x 10 £2429\n14 x 10 £2750  'url': 'https://www.charnleys.co.uk/shop/cabin-shed/'}

... so on

  • Related