I am using the Scrapy for broad crawling and have the following requirements:
- Scrapy will scrape the URL;
- Scrapy will parse the response from the URL and will write the parsed results in the file, say
file1.json
, if and only if the size offile1.json
is less than2GB
. Otherwise, Scrapy will create a new file, sayfile2.json
and write the response to this new file; - Once the response is returned, Scrapy will extract the URLs from the response and follow the extracted response. Then start with point 2.
Below is my code, I am able to perform step 1 & step 3, but couldn't understand where should I place the logic of creating the new file
, checking the size
and writing the response
.
def parse(self, response):
url = response.request.url
soup = BeautifulSoup(response.text, 'lxml')
d = {}
for element in soup.find_all():
if element.name in ["html", "body", "script", "footer"]:
pass
else:
x = element.find_all(text=True, recursive=False)
if x:
d[element.name] = x
yield d ---------> I want to write this dictionary in a file as per logic of step 2
for link in soup.find_all('a', href=True):
absoluteUrl = urllib.parse.urljoin(url, link['href'])
parsedUrl = urlparse(absoluteUrl)
if parsedUrl.scheme.strip().lower() != 'https' and parsedUrl.scheme.strip().lower() != 'http':
pass
else:
url = url.replace("'", r"\'")
absoluteUrl = absoluteUrl.replace("'", r"\'")
self.graph.run(
"MERGE (child:page{page_url:'" url "'}) "
"On CREATE "
"SET child.page_url='" url "', child.page_rank = 1.0 "
"MERGE (parent:page{page_url:'" absoluteUrl "'}) "
"On CREATE "
"SET parent.page_url = '" absoluteUrl "' , parent.page_rank = 1.0 "
"MERGE (child)-[:FOLLOWS]->(parent)"
)
yield response.follow(absoluteUrl, callback=self.parse). ---> Step 3 ( all good )
My question is where should I write the logic of (should it be in pipeline, middlewares, or init function of the spider) creating the file, checking the file size, and writing the spider response into that file?
Any help would be appreciated. I tried learning middleware, pipelines etc, but couldn't figure out how to implement this functionality.
CodePudding user response:
If you know the approximate number of items that every file should hold without exceeding the 2GB limit size, then out of the box you can use the FEED_EXPORT_BATCH_ITEM_COUNT
setting and scrapy will automatically create new files when the number of items in the file reach the above limit. Read more about this setting on the FEEDS page.