I'm new to scrapy. I made a script to scrap data from a website and it works fine, I get the results as a JSON file and it looks perfect. Now when I try to use my script to scrap multiple URLs (same site), it works, I can get the data in JSON file for each URL, but there is a bug. My printing structure is as bellow (as coded in the script)
[
{Title:,,,Description:,,,Brochure:}, #URL1
{titleDesc:,,,Content:}, #URL1
{attribute:} #URL1
]
when I put 2 URLs to scrap I get this:
[
{Title:,,,Description:,,,Brochure:}, #URL1
{titleDesc:,,,Content:}, #URL1
{attribute:},#URL1
{Title:,,,Description:,,,Brochure:}, #URL2
{titleDesc:,,,Content:}, #URL2
{attribute:} #URL2
]
It is still fine, but when I add more, the structure messes up and become like this:
[
{Title:,,,Description:,,,Brochure:}, #URL1
{titleDesc:,,,Content:}, #URL1
{attribute:}, #URL1
{Title:,,,Description:,,,Brochure:}, #URL2
{Title:,,,Description:,,,Brochure:}, #URL3
{titleDesc:,,,Content:}, #URL2
{attribute:}, #URL2
{titleDesc:,,,Content:}, #URL3
{attribute:}
]
If you see closely you will notice that the title of the third URL is below the title of the second one. Can somebody help, please?
import scrapy
class QuotesSpider(scrapy.Spider):
name = "attributes"
start_urls = ["https://product.sanyglobal.com/concrete_machinery/truck_mixer/119/161/",
"https://product.sanyglobal.com/concrete_machinery/truck_mixer/119/162/"]
def parse(self, response):
yield{
"title": response.css ("div.sku-top-title::text").get(),
"desc" : response.css ("div.sku-top-desc::text").get(),
"brochure" :'brochure'
}
for post in response.css(".el-collapse"):
for i in range(len(post.css(".el-collapse-item__header"))):
res=""
lst=post.css(".value-el-desc")
x=lst[i].css(".value-el-desc p::text").extract()
for y in x:
res =y.strip() "&&"
try:
yield{
"descTitle" : post.css('.el-collapse-item__header::text')[i].get().strip(),
"desc" :res
}
except:
continue
res=""
for post in response.css(".lie-one-canshu"):
try:
dicti = {"attribute" : post.css('.lie-one-canshu::text')[0].get().strip()}
yield dicti
except:
continue
UPDATE: I notice that the bug isn't permanent, sometimes I execute the script and the result is fine.
CodePudding user response:
Scrapy's is asynchronous, so there is no guarantee to the ordering in which item's are output or processed, at least not out of the box anyway. If you want all of the output from a single URL to come out together then I suggest you only yield 1 item from each call to the parse method....
For example :
def parse(self, response):
results = {
'items': [{
"title": response.css ("div.sku-top-title::text").get(),
"desc" : response.css ("div.sku-top-desc::text").get(),
"brochure" :'brochure'
}]
}
for post in response.css(".el-collapse"):
for i in range(len(post.css(".el-collapse-item__header"))):
res=""
lst=post.css(".value-el-desc")
x=lst[i].css(".value-el-desc p::text").extract()
for y in x:
res =y.strip() "&&"
try:
results['items'].append({
"descTitle" : post.css('.el-collapse-item__header::text')[i].get().strip(),
"desc" : res
})
except:
continue
res = ""
for post in response.css(".lie-one-canshu"):
try:
results['items'].append({
"attribute" : post.css('.lie-one-canshu::text')[0].get().strip()
})
except:
continue
yield results