Are all binary files through path below, so have to sequenceFile () to load the input path, but if free path below documents will be prompted to the file type is wrong, surprise,
If it is a normal file, with a textFile (), free path file can also be run
Want to ask next, is there any way to filter the empty file
CodePudding user response:
no one?CodePudding user response:
Or no one, and finally to a is not the way to way!Is not empty file directory on a file
With the spark will read this file in RDD, then the RDD content (file path) as another RDD input path
Then OK
Ps:
Too much because of the path of files (more than 100000, is a large file), running a little wrong! Alas, the heart tired!
Can't, will break up again directory file, ten file, each ten thousand,
Then, you're done, but I feel it's too inconvenient, the data is, every day will run before every time a new catalog file?
CodePudding user response:
Because this way is to read binary files, I didn't do with textFileCodePudding user response:
Before each execution, judgment under the size of the specified directory file information, make a selection, returning to the list path, the spark to load these paths, it is much defines a methodCodePudding user response:
Qiongwei