this is my code:
import scrapy
import pandas as pd
base_url = 'http://www.cleanman-cn.com/'
class CleanmanSpider(scrapy.Spider):
name = 'clean'
start_urls = ['http://www.cleanman-cn.com/productlist.php/']
def parse(self, response):
for cat in response.css('.wow.fadeInUp'):
name = cat.css('a > p::text').get()
if name is not None:
name = cat.css('a > p::text').get().strip()
link = cat.css('a::attr(href)').get()
categories = {
'Categorie' : name,
'Url' : base_url link
}
yield categories
csv = pd.read_csv(r'C:\Users\hermi\WebScraping\Scrapy\cleanman\cleanman\cleanmancategories.csv')
urls = csv['Url']
for url in urls:
yield scrapy.Request(url, callback=self.parse)
master = response.css('.web_prolist')
for item in master:
li = item.css('li')
for x in li:
link = x.css('a::attr(href)').get()
yield link
when i use scrapy shell to get my elements they turn out ok as shown
In [13]: master = response.css('.web_prolist')
In [18]: for item in master:
...: li = item.css('li')
...: for x in li:
...: link = x.css('a::attr(href)').get()
...: print(link)
...:
product_show.php?id=789
product_show.php?id=790
product_show.php?id=707
product_show.php?id=708
product_show.php?id=709
product_show.php?id=710
product_show.php?id=711
product_show.php?id=712
product_show.php?id=713
when i run my spider i get this result
2021-11-03 17:28:17 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.cleanman-cn.com/productlist.php/>
{'Categorie': 'Matching Series', 'Url': 'http://www.cleanman-cn.com/product.php?b_id=1'}
2021-11-03 17:28:17 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.cleanman-cn.com/productlist.php/>
{'Categorie': 'Two Piece Toilet', 'Url': 'http://www.cleanman-cn.com/product.php?b_id=2'}
2021-11-03 17:28:17 [scrapy.dupefilters] DEBUG: Filtered duplicate request: <GET http://www.cleanman-cn.com/product.php?b_id=1>
- no more duplicates will be shown (see DUPEFILTER_DEBUG to show all duplicates)
2021-11-03 17:28:17 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.cleanman-cn.com/productlist.php/>
{'Categorie': 'One Piece Toilet', 'Url': 'http://www.cleanman-cn.com/product.php?b_id=3'}
2021-11-03 17:28:17 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.cleanman-cn.com/productlist.php/>
{'Categorie': 'Wall-hung Toilet', 'Url': 'http://www.cleanman-cn.com/product.php?b_id=4'}
2021-11-03 17:28:17 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.cleanman-cn.com/productlist.php/>
{'Categorie': 'Art Basin', 'Url': 'http://www.cleanman-cn.com/product.php?b_id=5'}
2021-11-03 17:28:17 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.cleanman-cn.com/productlist.php/>
{'Categorie': 'Color Art Basin', 'Url': 'http://www.cleanman-cn.com/product.php?b_id=6'}
2021-11-03 17:28:17 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.cleanman-cn.com/productlist.php/>
{'Categorie': 'Matt Finish Series', 'Url': 'http://www.cleanman-cn.com/product.php?b_id=7'}
2021-11-03 17:28:17 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.cleanman-cn.com/productlist.php/>
{'Categorie': 'Intelligent Toilet', 'Url': 'http://www.cleanman-cn.com/product.php?b_id=8'}
2021-11-03 17:28:17 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.cleanman-cn.com/productlist.php/>
{'Categorie': 'Wall-hung Basin', 'Url': 'http://www.cleanman-cn.com/product.php?b_id=9'}
2021-11-03 17:28:17 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.cleanman-cn.com/productlist.php/>
{'Categorie': 'Pedestal basin', 'Url': 'http://www.cleanman-cn.com/product.php?b_id=10'}
2021-11-03 17:28:17 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.cleanman-cn.com/productlist.php/>
{'Categorie': 'Accessory', 'Url': 'http://www.cleanman-cn.com/product.php?b_id=11'}
2021-11-03 17:28:17 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.cleanman-cn.com/product.php?b_id=10> (referer: http://www.cleanman-cn.com/productlist.php/)
2021-11-03 17:28:17 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.cleanman-cn.com/product.php?b_id=3> (referer: http://www.cleanman-cn.com/productlist.php/)
2021-11-03 17:28:17 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.cleanman-cn.com/product.php?b_id=5> (referer: http://www.cleanman-cn.com/productlist.php/)
2021-11-03 17:28:17 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.cleanman-cn.com/product.php?b_id=11> (referer: http://www.cleanman-cn.com/productlist.php/)
2021-11-03 17:28:17 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.cleanman-cn.com/product.php?b_id=2> (referer: http://www.cleanman-cn.com/productlist.php/)
2021-11-03 17:28:17 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.cleanman-cn.com/product.php?b_id=4> (referer: http://www.cleanman-cn.com/productlist.php/)
2021-11-03 17:28:17 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.cleanman-cn.com/product.php?b_id=6> (referer: http://www.cleanman-cn.com/productlist.php/)
2021-11-03 17:28:17 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.cleanman-cn.com/product.php?b_id=1> (referer: http://www.cleanman-cn.com/productlist.php/)
2021-11-03 17:28:18 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.cleanman-cn.com/product.php?b_id=9> (referer: http://www.cleanman-cn.com/productlist.php/)
2021-11-03 17:28:18 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.cleanman-cn.com/product.php?b_id=7> (referer: http://www.cleanman-cn.com/productlist.php/)
2021-11-03 17:28:18 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.cleanman-cn.com/product.php?b_id=8> (referer: http://www.cleanman-cn.com/productlist.php/)
i'm using yield to get all links on first page for each category, using those links in the yield scrapy request to have a response for every url and get product links where i will get the detailed info
but i can get it to work, altough everything seems right to me and the shell results gives the right output
What am i doing wrong?
I'm a self taught python "developer" just for the fun of it and i believe i'm doing something wrong and i just can get to it. Please be kind with the critics on my code or with the way i code but that's my learning process.
Thanks in advance
CodePudding user response:
First of all you need to remove the / from the end of: ['http://www.cleanman-cn.com/productlist.php/']
(test it with and without the slash to see the difference).
You try to yield a string: ERROR: Spider must return request, item, or None, got 'str'
(link).
Also you might want to scrapy the link in another function:
import scrapy
import pandas as pd
base_url = 'http://www.cleanman-cn.com/'
class CleanmanSpider(scrapy.Spider):
name = 'clean'
# here I removed the slash at the end
start_urls = ['http://www.cleanman-cn.com/productlist.php']
def parse(self, response):
for cat in response.css('.wow.fadeInUp'):
name = cat.css('a > p::text').get()
if name is not None:
name = cat.css('a > p::text').get().strip()
link = cat.css('a::attr(href)').get()
categories = {
'Categorie' : name,
'Url' : base_url link
}
yield categories
csv = pd.read_csv(r'C:\Users\hermi\WebScraping\Scrapy\cleanman\cleanman\cleanmancategories.csv')
urls = csv['Url']
for url in urls:
# since I don't have your 'cleanmancategories' I tested it with url=base_url link
yield scrapy.Request(url=url, callback=self.parse_items)
def parse_items(self, response):
master = response.css('.web_prolist')
for item in master:
li = item.css('li')
for x in li:
link = x.css('a::attr(href)').get()
yield {'link': link}
Output:
{'link': 'product_show.php?id=773'}
[scrapy.core.scraper] DEBUG: Scraped from <200 http://www.cleanman-cn.com/product.php?b_id=1>
{'link': 'product_show.php?id=774'}
[scrapy.core.scraper] DEBUG: Scraped from <200 http://www.cleanman-cn.com/product.php?b_id=1>
{'link': 'product_show.php?id=775'}
...
...
...