The import org. Apache. Spark. RDD. RDD
The import org. Apache. Spark. SparkContext
The import org. Apache. Spark. Mllib. Feature. HashingTF
The import org. Apache. Spark. Mllib. Linalg. Vector
Val: sc SparkContext=...
//Load the documents (one per line).
Val documents: RDD [Seq [String]]=sc. TextFile ("... "). The map (_. The split (" "). ToSeq)
Val hashingTF=new hashingTF ()
Val: tf RDD (Vector)=hashingTF. Transform (documents)
The import org. Apache. Spark. Mllib. Feature. The IDF
//... The continue from the previous example
Tf. The cache ()
Val idf=new idf (.) fit (tf)
Val tfidf: RDD (Vector)=idf. Transform (tf)
Finally get is the RDD Vector, the Vector is an abstract class, general returns its subclasses SparseVector here, contains three domains: the size, indices, values, and an array of values is a Double type, tf - idf values of each word is in the document, however, when I want to remove this value corresponds to the word, only to find that do not know how to start, do not know to find the corresponding words, have a great god know?
CodePudding user response:
Good less people here, you can @ cloud881001 askCodePudding user response:
Hello, your problem solved? To ask how to solve the corresponding to the wordCodePudding user response:
Hi, I solved? Can be said about the solution?CodePudding user response:
Hello,http://stackoverflow.com/questions/35205865/what-is-the-difference-between-hashingtf-and-countvectorizer-in-spark
HashingTF irreversible, CountVectorizer I also didn't find how to reverse, don't know you solve have no?
Rube. Q
CodePudding user response:
JavaRDDIdfvector. Foreach (new VoidFunction
/* *
*
*/
private static final long serialVersionUID=1L;
@ Override
Public void call Vector (t) throws the Exception {
SparseVector ss=(SparseVector) t;
Double [] aa=ss. The values ();
System. The out. Println (" idf - "+ t +" - st - "+ aa [2]).
}
});
Java write strong turn next