I am trying to understand the context of the working and not working program which connects HDFS via nameservice(which connects active name node - High availability Namenode) outside HDFS cluster.
Not working program:
When i read both config files (core-site.xml and hdfs-site.xml) and accessing HDFS file , it throws an error
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.fs.{FileSystem, Path}
object HadoopAccess {
def main(args: Array[String]): Unit ={
val hadoopConf = new Configuration(false)
val coreSiteXML = "C:\\Users\\507\\conf\\core-site.xml"
val HDFSSiteXML = "C:\\Users\\507\\conf\\hdfs-site.xml"
hadoopConf.addResource(new Path("file:///" coreSiteXML))
hadoopConf.addResource(new Path("file:///" HDFSSiteXML))
println("hadoopConf : " hadoopConf.get("fs.defaultFS"))
val fs = FileSystem.get(hadoopConf)
val check = fs.exists(new Path("/apps/hive"));
//println("Checked : " check)
}
}
Error : We see that Unknownhost Exception
hadoopConf :
hdfs://mycluster
Configuration: file:/C:/Users/64507/conf/core-site.xml, file:/C:/Users/64507/conf/hdfs-site.xml
log4j:WARN No appenders could be found for logger (org.apache.hadoop.metrics2.lib.MutableMetricsFactory).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
Exception in thread "main" java.lang.IllegalArgumentException: java.net.UnknownHostException: mycluster
at org.apache.hadoop.security.SecurityUtil.buildTokenService(SecurityUtil.java:378)
at org.apache.hadoop.hdfs.NameNodeProxies.createNonHAProxy(NameNodeProxies.java:310)
at org.apache.hadoop.hdfs.NameNodeProxies.createProxy(NameNodeProxies.java:176)
at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:678)
at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:619)
at org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:149)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2669)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:94)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2703)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2685)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:373)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:172)
at HadoopAccess$.main(HadoopAccess.scala:28)
at HadoopAccess.main(HadoopAccess.scala)
Caused by: java.net.UnknownHostException: mycluster
Working Program : I specifically set the High availability into hadoopConf object and passing to Filesystem object , the program works
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.fs.{FileSystem, Path}
object HadoopAccess {
def main(args: Array[String]): Unit ={
val hadoopConf = new Configuration(false)
val coreSiteXML = "C:\\Users\\507\\conf\\core-site.xml"
val HDFSSiteXML = "C:\\Users\\507\\conf\\hdfs-site.xml"
hadoopConf.addResource(new Path("file:///" coreSiteXML))
hadoopConf.addResource(new Path("file:///" HDFSSiteXML))
hadoopConf.set("fs.defaultFS", hadoopConf.get("fs.defaultFS"))
//hadoopConf.set("fs.defaultFS", "hdfs://mycluster")
//hadoopConf.set("fs.default.name", hadoopConf.get("fs.defaultFS"))
hadoopConf.set("dfs.nameservices", hadoopConf.get("dfs.nameservices"))
hadoopConf.set("dfs.ha.namenodes.mycluster", "nn1,nn2")
hadoopConf.set("dfs.namenode.rpc-address.mycluster.nn1", "namenode1:8020")
hadoopConf.set("dfs.namenode.rpc-address.mycluster.nn2", "namenode2:8020")
hadoopConf.set("dfs.client.failover.proxy.provider.mycluster",
"org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider")
println(hadoopConf)
/* val namenode = hadoopConf.get("fs.defaultFS")
println("namenode: " namenode) */
val fs = FileSystem.get(hadoopConf)
val check = fs.exists(new Path("hdfs://mycluster/apps/hive"));
//println("Checked : " check)
}
}
Any reason why we need to set values for this configs like dfs.nameservices,fs.client.failover.proxy.provider.mycluster,dfs.namenode.rpc-address.mycluster.nn1
in hadoopconf object as this values already present in hdfs-site.xml file and core-site.xml. These configs are High availability Namenode settings.
The above program which I am running via Edge mode or local IntelliJ.
Hadoop version : 2.7.3.2 Hortonworks : 2.6.1
My observation in Spark Scala REPL :
When I do val hadoopConf = new Configuration(false)
and val fs = FileSystem.get(hadoopConf)
.This gives me Local FileSystem .So when I perform below
hadoopConf.addResource(new Path("file:///" coreSiteXML))
hadoopConf.addResource(new Path("file:///" HDFSSiteXML))
,now the file System changed to DFSFileSysyem ..My assumption is that some client library which is in Spark that is not available in somewhere in during build or edge node common place .
CodePudding user response:
some client library which is in Spark that is not available in somewhere in during build or edge node common place
This common place would be $SPARK_HOME/conf
and/or $HADOOP_CONF_DIR
. But if you are just running a regular Scala app with java jar
or with IntelliJ, that has nothing to do with Spark.
... this values already present in hdfs-site.xml file and core-site.xml
Then, they should be read, accordingly, however overriding in the code shouldn't hurt either.
The values are necessary because they dicate where the actual namenodes are running; otherwise, it thinks mycluster
is a real DNS name of only one server, when it isn't