Home > Blockchain >  Apache Nutch not reading a new configuration file when run with job file
Apache Nutch not reading a new configuration file when run with job file

Time:06-14

I have configured Apache Nutch 1.x for web crawling. There is a requirement that I should add some extra information to Solr document for each domain that is indexed. Configuration is a JSON file. I have developed following code for this and tested in local mode successfully. I have updated index-basic plugin. Code snippet is as follows:

this.enable_extra_domain  = conf.getBoolean("domain.extraInfo.enable", false);
    if (this.enable_extra_domain) {
         String domainExtraInfo = conf.get("domain.extraInfo.file","conf/domain-extra.json");
         readDomainFile(domainExtraInfo);
         LOG.info("domain.extraInfo.enable is enabled. Using "   domainExtraInfo   " for input.");
    }
    else {
        LOG.info("domain.extraInfo.enable is disabled.");
    }

And the function where reading file is done is as below

private void readDomainFile(String domainExtraInfo) {
    // Instance of our Domain map with extra info
    website_records = new HashMap<String, List<Object>>();
    
    JSONParser jsonParser = new JSONParser();
    try (FileReader reader = new FileReader(domainExtraInfo))
    {
        Object obj = jsonParser.parse(reader);
        JSONArray DomainList = (JSONArray) obj;
  
        DomainList.forEach( domain -> parseDomainObject( (JSONObject) domain ) );
        
    }
    catch (Exception e) {
        // TODO: handle exception
        e.printStackTrace();
    }
}

This code work successfully when I run it in local mode. But when I run Nutch with .job file to run on EMR (or other Hadoop cluster), I faced java.io.filenotfoundexception. Where is the problem ? I have my new configuration file in conf folder in local mode while in deploy, it is added in .job file

CodePudding user response:

I have my new configuration file in conf folder in local mode while in deploy, it is added in .job file

In distributed mode the file needs to be read from the job file deployed to the Hadoop cluster nodes. The easiest way is to use the methods provided by the Hadoop Configuration class, for example getConfResourceAsReader(String name). Note: the argument "name" is the file name without the directory part ("domain-extra.json"). You'll find a lot of examples in the Nutch source code, eg. in one of the URL filters.

  • Related