Home > other >  Write processed results in JSON files
Write processed results in JSON files

Time:04-07

I am using the Scrapy for broad crawling and have the following requirements:

  1. Scrapy will scrape the URL;
  2. Scrapy will parse the response from the URL and will write the parsed results in the file, say file1.json, if and only if the size of file1.json is less than 2GB. Otherwise, Scrapy will create a new file, say file2.json and write the response to this new file;
  3. Once the response is returned, Scrapy will extract the URLs from the response and follow the extracted response. Then start with point 2.

Below is my code, I am able to perform step 1 & step 3, but couldn't understand where should I place the logic of creating the new file, checking the size and writing the response.

def parse(self, response):

    url = response.request.url
    soup = BeautifulSoup(response.text, 'lxml')

    d = {}
    for element in soup.find_all():
        if element.name in ["html", "body", "script", "footer"]:
            pass

        else:
            x = element.find_all(text=True, recursive=False)
            if x:
                d[element.name] = x

    yield d ---------> I want to write this dictionary in a file as per logic of step 2

    for link in soup.find_all('a', href=True):
        absoluteUrl = urllib.parse.urljoin(url, link['href'])
        parsedUrl = urlparse(absoluteUrl)
        if parsedUrl.scheme.strip().lower() != 'https' and parsedUrl.scheme.strip().lower() != 'http':
            pass
        else:

            url = url.replace("'", r"\'")
            absoluteUrl = absoluteUrl.replace("'", r"\'")

            self.graph.run(
                "MERGE (child:page{page_url:'"   url   "'}) "  
                "On CREATE "  
                "SET child.page_url='"   url   "', child.page_rank = 1.0 "  
                "MERGE (parent:page{page_url:'"   absoluteUrl   "'}) "  
                "On CREATE "  
                "SET parent.page_url = '"   absoluteUrl   "' , parent.page_rank = 1.0 "  
                "MERGE (child)-[:FOLLOWS]->(parent)"
            )

            yield response.follow(absoluteUrl, callback=self.parse). ---> Step 3 ( all good ) 

My question is where should I write the logic of (should it be in pipeline, middlewares, or init function of the spider) creating the file, checking the file size, and writing the spider response into that file?

Any help would be appreciated. I tried learning middleware, pipelines etc, but couldn't figure out how to implement this functionality.

CodePudding user response:

If you know the approximate number of items that every file should hold without exceeding the 2GB limit size, then out of the box you can use the FEED_EXPORT_BATCH_ITEM_COUNT setting and scrapy will automatically create new files when the number of items in the file reach the above limit. Read more about this setting on the FEEDS page.

  • Related