Home > Enterprise >  Scrapy script turns elements in shell but not when i run the spider
Scrapy script turns elements in shell but not when i run the spider

Time:11-09

this is my code:

import scrapy
import pandas as pd

base_url = 'http://www.cleanman-cn.com/'

class CleanmanSpider(scrapy.Spider):
    name = 'clean'
    
    start_urls = ['http://www.cleanman-cn.com/productlist.php/']

    def parse(self, response):
        for cat in response.css('.wow.fadeInUp'):
                name = cat.css('a > p::text').get()
                if name is not None:
                    name = cat.css('a > p::text').get().strip()
                    link  = cat.css('a::attr(href)').get()
            
                    categories = {
                        'Categorie' : name,
                        'Url' : base_url   link
                    }
                    yield categories
                           
                    csv = pd.read_csv(r'C:\Users\hermi\WebScraping\Scrapy\cleanman\cleanman\cleanmancategories.csv')
                    urls = csv['Url']

                    for url in urls:
                        yield scrapy.Request(url, callback=self.parse)
                        master = response.css('.web_prolist')
                        for item in master:
                            li = item.css('li')
                            for x in li:
                                link = x.css('a::attr(href)').get()
                                yield link

when i use scrapy shell to get my elements they turn out ok as shown

In [13]: master = response.css('.web_prolist')

In [18]: for item in master:
    ...:     li = item.css('li')
    ...:     for x in li:
    ...:         link = x.css('a::attr(href)').get()
    ...:         print(link)
    ...: 
product_show.php?id=789
product_show.php?id=790
product_show.php?id=707
product_show.php?id=708
product_show.php?id=709
product_show.php?id=710
product_show.php?id=711
product_show.php?id=712
product_show.php?id=713

when i run my spider i get this result

2021-11-03 17:28:17 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.cleanman-cn.com/productlist.php/>
{'Categorie': 'Matching Series', 'Url': 'http://www.cleanman-cn.com/product.php?b_id=1'}
2021-11-03 17:28:17 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.cleanman-cn.com/productlist.php/>
{'Categorie': 'Two Piece Toilet', 'Url': 'http://www.cleanman-cn.com/product.php?b_id=2'}
2021-11-03 17:28:17 [scrapy.dupefilters] DEBUG: Filtered duplicate request: <GET http://www.cleanman-cn.com/product.php?b_id=1> 
- no more duplicates will be shown (see DUPEFILTER_DEBUG to show all duplicates)
2021-11-03 17:28:17 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.cleanman-cn.com/productlist.php/>
{'Categorie': 'One Piece Toilet', 'Url': 'http://www.cleanman-cn.com/product.php?b_id=3'}
2021-11-03 17:28:17 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.cleanman-cn.com/productlist.php/>
{'Categorie': 'Wall-hung Toilet', 'Url': 'http://www.cleanman-cn.com/product.php?b_id=4'}
2021-11-03 17:28:17 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.cleanman-cn.com/productlist.php/>
{'Categorie': 'Art Basin', 'Url': 'http://www.cleanman-cn.com/product.php?b_id=5'}
2021-11-03 17:28:17 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.cleanman-cn.com/productlist.php/>
{'Categorie': 'Color Art Basin', 'Url': 'http://www.cleanman-cn.com/product.php?b_id=6'}
2021-11-03 17:28:17 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.cleanman-cn.com/productlist.php/>
{'Categorie': 'Matt Finish Series', 'Url': 'http://www.cleanman-cn.com/product.php?b_id=7'}
2021-11-03 17:28:17 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.cleanman-cn.com/productlist.php/>
{'Categorie': 'Intelligent Toilet', 'Url': 'http://www.cleanman-cn.com/product.php?b_id=8'}
2021-11-03 17:28:17 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.cleanman-cn.com/productlist.php/>
{'Categorie': 'Wall-hung Basin', 'Url': 'http://www.cleanman-cn.com/product.php?b_id=9'}
2021-11-03 17:28:17 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.cleanman-cn.com/productlist.php/>
{'Categorie': 'Pedestal basin', 'Url': 'http://www.cleanman-cn.com/product.php?b_id=10'}
2021-11-03 17:28:17 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.cleanman-cn.com/productlist.php/>
{'Categorie': 'Accessory', 'Url': 'http://www.cleanman-cn.com/product.php?b_id=11'}
2021-11-03 17:28:17 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.cleanman-cn.com/product.php?b_id=10> (referer: http://www.cleanman-cn.com/productlist.php/)
2021-11-03 17:28:17 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.cleanman-cn.com/product.php?b_id=3> (referer: http://www.cleanman-cn.com/productlist.php/)
2021-11-03 17:28:17 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.cleanman-cn.com/product.php?b_id=5> (referer: http://www.cleanman-cn.com/productlist.php/)
2021-11-03 17:28:17 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.cleanman-cn.com/product.php?b_id=11> (referer: http://www.cleanman-cn.com/productlist.php/)
2021-11-03 17:28:17 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.cleanman-cn.com/product.php?b_id=2> (referer: http://www.cleanman-cn.com/productlist.php/)
2021-11-03 17:28:17 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.cleanman-cn.com/product.php?b_id=4> (referer: http://www.cleanman-cn.com/productlist.php/)
2021-11-03 17:28:17 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.cleanman-cn.com/product.php?b_id=6> (referer: http://www.cleanman-cn.com/productlist.php/)
2021-11-03 17:28:17 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.cleanman-cn.com/product.php?b_id=1> (referer: http://www.cleanman-cn.com/productlist.php/)
2021-11-03 17:28:18 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.cleanman-cn.com/product.php?b_id=9> (referer: http://www.cleanman-cn.com/productlist.php/)
2021-11-03 17:28:18 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.cleanman-cn.com/product.php?b_id=7> (referer: http://www.cleanman-cn.com/productlist.php/)
2021-11-03 17:28:18 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.cleanman-cn.com/product.php?b_id=8> (referer: http://www.cleanman-cn.com/productlist.php/)

i'm using yield to get all links on first page for each category, using those links in the yield scrapy request to have a response for every url and get product links where i will get the detailed info

but i can get it to work, altough everything seems right to me and the shell results gives the right output

What am i doing wrong?

I'm a self taught python "developer" just for the fun of it and i believe i'm doing something wrong and i just can get to it. Please be kind with the critics on my code or with the way i code but that's my learning process.

Thanks in advance

CodePudding user response:

First of all you need to remove the / from the end of: ['http://www.cleanman-cn.com/productlist.php/'] (test it with and without the slash to see the difference).

You try to yield a string: ERROR: Spider must return request, item, or None, got 'str' (link).

Also you might want to scrapy the link in another function:

import scrapy
import pandas as pd

base_url = 'http://www.cleanman-cn.com/'

class CleanmanSpider(scrapy.Spider):
    name = 'clean'

    # here I removed the slash at the end
    start_urls = ['http://www.cleanman-cn.com/productlist.php']

    def parse(self, response):
        for cat in response.css('.wow.fadeInUp'):
            name = cat.css('a > p::text').get()
            if name is not None:
                name = cat.css('a > p::text').get().strip()
                link  = cat.css('a::attr(href)').get()

                categories = {
                    'Categorie' : name,
                    'Url' : base_url   link
                }
                yield categories

                csv = pd.read_csv(r'C:\Users\hermi\WebScraping\Scrapy\cleanman\cleanman\cleanmancategories.csv')
                urls = csv['Url']

                for url in urls:
                    # since I don't have your 'cleanmancategories' I tested it with url=base_url   link
                    yield scrapy.Request(url=url, callback=self.parse_items)


    def parse_items(self, response):
        master = response.css('.web_prolist')
        for item in master:
            li = item.css('li')
            for x in li:
                link = x.css('a::attr(href)').get()
                yield {'link': link}

Output:

{'link': 'product_show.php?id=773'}
[scrapy.core.scraper] DEBUG: Scraped from <200 http://www.cleanman-cn.com/product.php?b_id=1>
{'link': 'product_show.php?id=774'}
[scrapy.core.scraper] DEBUG: Scraped from <200 http://www.cleanman-cn.com/product.php?b_id=1>
{'link': 'product_show.php?id=775'}
...
...
...
  • Related