In Scala, I am trying to count the files from an Hdfs directory.
I tryed to get a list of the files with val files = fs.listFiles(path, false)
and make a count on it or get it's size, but it doesn't work as files
type is RemoteIterator[LocatedFileStatus]
Any idea on how I should process ?
Thank's for helping
CodePudding user response:
This has been done before but generally people use the FSImage. (A copy of the name node file.)
They'll then throw that into a hive table and then you can query it for information about your hdfs file system.
Here's a really good tutorial that explains how to export the fsimage and throw it into a hive table.
Here's another that I think I prefer:
Fetch and copy the fsimage file into HDFS #connect to any hadoop cluster node as hdfs user
#downloads the fsimage file from namenode hdfs dfsadmin -fetchImage /tmp #converts the fsimage file into tab delimited file hdfs oiv -i /tmp/fsimage_0000000000450297390 -o /tmp/fsimage.csv -p Delimited #remove the header and copy to HDFS sed -i -e "1d" fsimage.csv hdfs dfs -mkdir /tmp/fsimage hdfs dfs -copyFromLocal /tmp/fsimage.csv /tmp/fsimage #create the intermediate external table in impala CREATE EXTERNAL TABLE HDFS_META_D ( PATH STRING , REPL INT , MODIFICATION_TIME STRING , ACCESSTIME STRING , PREFERREDBLOCKSIZE INT , BLOCKCOUNT DOUBLE, FILESIZE DOUBLE , NSQUOTA INT , DSQUOTA INT , PERMISSION STRING , USERNAME STRING , GROUPNAME STRING) row format delimited fields terminated by '\t' LOCATION '/tmp/fsimage';
Once it's in a table you really can do the rest in scala/spark.
CodePudding user response:
I end up using:
var count: Int = 0
while (files.hasNext) {
files.next
count = 1
}
As a Scala begginer, I didn't know how to make a count
(the answear is count = 1
). This actually works quiet well