Spark input paths, filter the empty file-CodePudding

Such as:
Are all binary files through path below, so have to sequenceFile () to load the input path, but if free path below documents will be prompted to the file type is wrong, surprise,
If it is a normal file, with a textFile (), free path file can also be run
Want to ask next, is there any way to filter the empty file

CodePudding user response:

no one?

CodePudding user response:

Or no one, and finally to a is not the way to way!
Is not empty file directory on a file
With the spark will read this file in RDD, then the RDD content (file path) as another RDD input path
Then OK
Ps:
Too much because of the path of files (more than 100000, is a large file), running a little wrong! Alas, the heart tired!
Can't, will break up again directory file, ten file, each ten thousand,
Then, you're done, but I feel it's too inconvenient, the data is, every day will run before every time a new catalog file?

CodePudding user response:

Because this way is to read binary files, I didn't do with textFile

CodePudding user response:

Before each execution, judgment under the size of the specified directory file information, make a selection, returning to the list path, the spark to load these paths, it is much defines a method

CodePudding user response:

Qiongwei

reference 4 floor response:

before each execution, judgment under the size of the specified directory file information, make a selection, returning to the list path, the spark to load these paths, is more defined a method

Used custom pathFilter,

 
The Configuration hadoopConf=sc. HadoopConfiguration (); 
HadoopConf. Set (graphs. Input. PathFilter. "class", "com. Filter. FileFilter");

But there are two problems, one is to do a screening will take a long time, the second is screening out, also there are still hundreds of multiple files, load directly with spark, complains,
Java nio. Channels. ClosedChannelException

CodePudding user response:

It depends on when this part of the data to the ground when what strategy is, or at the time of landing judgment file size, make filtering operation, or after landing, according to some simple rules to classify the data according to the folder and in the end, the data of each file to merge a reduction in the number of files, and so on. The data is some initialization, let it become effective, easy to read data.