Accessing HDFS configured as High availability from Client program-CodePudding

I am trying to understand the context of the working and not working program which connects HDFS via nameservice(which connects active name node - High availability Namenode) outside HDFS cluster.

Not working program:

When i read both config files (core-site.xml and hdfs-site.xml) and accessing HDFS file , it throws an error

import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.fs.{FileSystem, Path}

object HadoopAccess {

    def main(args: Array[String]): Unit ={
      val hadoopConf = new Configuration(false)
      val coreSiteXML = "C:\\Users\\507\\conf\\core-site.xml"
      val HDFSSiteXML = "C:\\Users\\507\\conf\\hdfs-site.xml"
      hadoopConf.addResource(new Path("file:///"   coreSiteXML))
      hadoopConf.addResource(new Path("file:///"   HDFSSiteXML))
      println("hadoopConf : "   hadoopConf.get("fs.defaultFS"))

      val fs = FileSystem.get(hadoopConf)
      val check = fs.exists(new Path("/apps/hive"));
//println("Checked : "  check)

 }

 }

Error : We see that Unknownhost Exception

hadoopConf :

hdfs://mycluster
Configuration: file:/C:/Users/64507/conf/core-site.xml, file:/C:/Users/64507/conf/hdfs-site.xml
log4j:WARN No appenders could be found for logger (org.apache.hadoop.metrics2.lib.MutableMetricsFactory).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
Exception in thread "main" java.lang.IllegalArgumentException: java.net.UnknownHostException: mycluster
    at org.apache.hadoop.security.SecurityUtil.buildTokenService(SecurityUtil.java:378)
    at org.apache.hadoop.hdfs.NameNodeProxies.createNonHAProxy(NameNodeProxies.java:310)
    at org.apache.hadoop.hdfs.NameNodeProxies.createProxy(NameNodeProxies.java:176)
    at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:678)
    at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:619)
    at org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:149)
    at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2669)
    at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:94)
    at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2703)
    at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2685)
    at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:373)
    at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:172)
    at HadoopAccess$.main(HadoopAccess.scala:28)
    at HadoopAccess.main(HadoopAccess.scala)
Caused by: java.net.UnknownHostException: mycluster

Working Program : I specifically set the High availability into hadoopConf object and passing to Filesystem object , the program works

    import org.apache.hadoop.conf.Configuration
    import org.apache.hadoop.fs.{FileSystem, Path}

    object HadoopAccess {

    def main(args: Array[String]): Unit ={
    val hadoopConf = new Configuration(false)
    val coreSiteXML = "C:\\Users\\507\\conf\\core-site.xml"
    val HDFSSiteXML = "C:\\Users\\507\\conf\\hdfs-site.xml"
    hadoopConf.addResource(new Path("file:///"   coreSiteXML))
    hadoopConf.addResource(new Path("file:///"   HDFSSiteXML))
   

    hadoopConf.set("fs.defaultFS", hadoopConf.get("fs.defaultFS"))
    //hadoopConf.set("fs.defaultFS", "hdfs://mycluster")
    //hadoopConf.set("fs.default.name", hadoopConf.get("fs.defaultFS"))
    hadoopConf.set("dfs.nameservices", hadoopConf.get("dfs.nameservices"))
    hadoopConf.set("dfs.ha.namenodes.mycluster", "nn1,nn2")
    hadoopConf.set("dfs.namenode.rpc-address.mycluster.nn1", "namenode1:8020")
    hadoopConf.set("dfs.namenode.rpc-address.mycluster.nn2", "namenode2:8020")
    hadoopConf.set("dfs.client.failover.proxy.provider.mycluster", 
    "org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider")
    println(hadoopConf)
    /* val namenode = hadoopConf.get("fs.defaultFS")

    println("namenode: "  namenode) */

    val fs = FileSystem.get(hadoopConf)
   val check = fs.exists(new Path("hdfs://mycluster/apps/hive"));
    //println("Checked : "  check)

     }

     }

Any reason why we need to set values for this configs like dfs.nameservices,fs.client.failover.proxy.provider.mycluster,dfs.namenode.rpc-address.mycluster.nn1 in hadoopconf object as this values already present in hdfs-site.xml file and core-site.xml. These configs are High availability Namenode settings.

The above program which I am running via Edge mode or local IntelliJ.

Hadoop version : 2.7.3.2 Hortonworks : 2.6.1

My observation in Spark Scala REPL :

When I do val hadoopConf = new Configuration(false) and val fs = FileSystem.get(hadoopConf) .This gives me Local FileSystem .So when I perform below

hadoopConf.addResource(new Path("file:///"   coreSiteXML))
    hadoopConf.addResource(new Path("file:///"   HDFSSiteXML))

,now the file System changed to DFSFileSysyem ..My assumption is that some client library which is in Spark that is not available in somewhere in during build or edge node common place .

CodePudding user response：

some client library which is in Spark that is not available in somewhere in during build or edge node common place

This common place would be $SPARK_HOME/conf and/or $HADOOP_CONF_DIR. But if you are just running a regular Scala app with java jar or with IntelliJ, that has nothing to do with Spark.

... this values already present in hdfs-site.xml file and core-site.xml

Then, they should be read, accordingly, however overriding in the code shouldn't hurt either.

The values are necessary because they dicate where the actual namenodes are running; otherwise, it thinks mycluster is a real DNS name of only one server, when it isn't