Spark the tuples in RDD according to the specified format to the HDFS?-CodePudding

Ask a question: what was the result of the spark data cleaning to RDD [(String, String)] type of RDD in the RDD, each element is

A tuple, the key value of the tuple is a filename, the value value is the file content, I want to keep the whole RDD on HDFS, let each of RDD

Element is saved as a file, in which the key values as the file name, and the value value as the file content,

Should be how to do?

RDD doesn't seem to support the traversal, can only collect () method is saved as an array, traverse again, but this might explode,

The current approach is to put RDD saveAsTextFile way to save on HDFS, then use FSDataInputStream input stream to save

After traverse part file to read, use the output stream writes HDFS, this is very time consuming,

Excuse me, is there a better way, can write the content of the RDD on HDFS directly?

CodePudding user response:

Each element is stored as a file? Still say to each of the same the all key elements to the key file name, values for content?

CodePudding user response:

The foreach operator, don't use saveAsTextFile file stream processing