Count files in HDFS directory with Scala-CodePudding

In Scala, I am trying to count the files from an Hdfs directory. I tryed to get a list of the files with val files = fs.listFiles(path, false) and make a count on it or get it's size, but it doesn't work as files type is RemoteIterator[LocatedFileStatus]

Any idea on how I should process ?

Thank's for helping

CodePudding user response：

This has been done before but generally people use the FSImage. (A copy of the name node file.)

They'll then throw that into a hive table and then you can query it for information about your hdfs file system.

Here's a really good tutorial that explains how to export the fsimage and throw it into a hive table.

Here's another that I think I prefer:

Fetch and copy the fsimage file into HDFS #connect to any hadoop cluster node as hdfs user

#downloads the fsimage file from namenode
hdfs dfsadmin -fetchImage /tmp

#converts the fsimage file into tab delimited file
hdfs oiv -i /tmp/fsimage_0000000000450297390 -o /tmp/fsimage.csv -p Delimited

#remove the header and copy to HDFS
sed -i -e "1d" fsimage.csv
hdfs dfs -mkdir /tmp/fsimage
hdfs dfs -copyFromLocal /tmp/fsimage.csv /tmp/fsimage

#create the intermediate external table in impala
CREATE EXTERNAL TABLE HDFS_META_D ( 
 PATH STRING , 
 REPL INT , 
 MODIFICATION_TIME STRING , 
 ACCESSTIME STRING , 
 PREFERREDBLOCKSIZE INT , 
 BLOCKCOUNT DOUBLE, 
 FILESIZE DOUBLE , 
 NSQUOTA INT , 
 DSQUOTA INT , 
 PERMISSION STRING , 
 USERNAME STRING , 
 GROUPNAME STRING) 
row format delimited
fields terminated by '\t'
LOCATION '/tmp/fsimage';

Once it's in a table you really can do the rest in scala/spark.

CodePudding user response：

I end up using:

var count: Int = 0
while (files.hasNext) {
  files.next
  count  = 1
}

As a Scala begginer, I didn't know how to make a count (the answear is count = 1). This actually works quiet well