Home > Blockchain >  Nutch best option for persistent storage in EMR for raw data
Nutch best option for persistent storage in EMR for raw data

Time:05-09

I have to crawl around 30k to 50k domains with Nutch 1.x on EMR AWS service. It will be gradual i.e., first crawl all pages and later only new or updated pages for these websites. For indexing, I am using Apache Solr. I have few queries for best practices with EMR

  1. If I have to re-index or analyze old crawled data, I think raw data should be stored on S3. Is it the right option?
  2. Is it better to configure S3 as back-end storage of HDFS for my first question or I should copy folder at the end of EMR job manually.
  3. In either case, to optimize storage in S3 for raw data, how can I compress data when importing or exporting from/to EMR cluster to/from S3.
  4. How can I instruct Nutch to crawl only new found pages from given seed

CodePudding user response:

  1. Nutch is able to read/write directly from S3, see using-s3-as-nutch-storage-system.
  2. Writing segments and CrawlDb directly to S3 makes sense. But to keep it on HDFS and then copying (distcp) to S3 is also possible.
  3. See mapreduce.output.fileoutputformat.compress.codec - org.apache.hadoop.io.compress.ZStandardCodec is a good option.
  4. (better ask this again separately) Do the crawled domains all provide sitemaps? Otherwise, the challenge is to many new URLs with re-fetching as less possible known pages. If you want all new pages or make sure all removed pages are recognized as such, it's easier to recrawl everything.
  • Related