Home > front end >  order of json data is messed up when scraping multiple urls Scrapy
order of json data is messed up when scraping multiple urls Scrapy

Time:08-21

I'm new to scrapy. I made a script to scrap data from a website and it works fine, I get the results as a JSON file and it looks perfect. Now when I try to use my script to scrap multiple URLs (same site), it works, I can get the data in JSON file for each URL, but there is a bug. My printing structure is as bellow (as coded in the script)

[
{Title:,,,Description:,,,Brochure:}, #URL1
{titleDesc:,,,Content:},  #URL1
{attribute:} #URL1
]

when I put 2 URLs to scrap I get this:

[
{Title:,,,Description:,,,Brochure:}, #URL1
{titleDesc:,,,Content:}, #URL1
{attribute:},#URL1
{Title:,,,Description:,,,Brochure:}, #URL2
{titleDesc:,,,Content:}, #URL2
{attribute:} #URL2
] 

It is still fine, but when I add more, the structure messes up and become like this:

[
{Title:,,,Description:,,,Brochure:}, #URL1
{titleDesc:,,,Content:}, #URL1
{attribute:}, #URL1
{Title:,,,Description:,,,Brochure:}, #URL2
{Title:,,,Description:,,,Brochure:}, #URL3
{titleDesc:,,,Content:}, #URL2
{attribute:}, #URL2
{titleDesc:,,,Content:}, #URL3
{attribute:}
]

If you see closely you will notice that the title of the third URL is below the title of the second one. Can somebody help, please?

import scrapy


class QuotesSpider(scrapy.Spider):
    name = "attributes"
    start_urls = ["https://product.sanyglobal.com/concrete_machinery/truck_mixer/119/161/",
    "https://product.sanyglobal.com/concrete_machinery/truck_mixer/119/162/"]

    def parse(self, response):
        yield{
            "title": response.css ("div.sku-top-title::text").get(),
            "desc" : response.css ("div.sku-top-desc::text").get(),
            "brochure" :'brochure'  
        }
        for post in response.css(".el-collapse"):
            for i in range(len(post.css(".el-collapse-item__header"))):
                res=""
                lst=post.css(".value-el-desc")
                x=lst[i].css(".value-el-desc p::text").extract()
                for y in x:
                    res =y.strip() "&&"
                try:      
                    yield{         
                        "descTitle" : post.css('.el-collapse-item__header::text')[i].get().strip(),
                        "desc" :res 
                        }  
                except:
                    continue
            res=""
            
        
        for post in response.css(".lie-one-canshu"):
            try:       
                dicti = {"attribute" : post.css('.lie-one-canshu::text')[0].get().strip()}
                yield dicti                   
            except:
                continue

UPDATE: I notice that the bug isn't permanent, sometimes I execute the script and the result is fine.

CodePudding user response:

Scrapy's is asynchronous, so there is no guarantee to the ordering in which item's are output or processed, at least not out of the box anyway. If you want all of the output from a single URL to come out together then I suggest you only yield 1 item from each call to the parse method....

For example :

def parse(self, response):
    results = {
       'items': [{
           "title": response.css ("div.sku-top-title::text").get(),
           "desc" : response.css ("div.sku-top-desc::text").get(),
           "brochure" :'brochure'  
        }]
    }
    for post in response.css(".el-collapse"):
        for i in range(len(post.css(".el-collapse-item__header"))):
            res=""
            lst=post.css(".value-el-desc")
            x=lst[i].css(".value-el-desc p::text").extract()
            for y in x:
                res =y.strip() "&&"
            try:      
                results['items'].append({         
                    "descTitle" : post.css('.el-collapse-item__header::text')[i].get().strip(),
                    "desc" : res 
                 }) 
            except:
                continue
        res = ""
            
        
    for post in response.css(".lie-one-canshu"):
        try:       
            results['items'].append({  
                "attribute" : post.css('.lie-one-canshu::text')[0].get().strip()
            })
        except:
            continue
    yield results
  • Related