About spark mllib FPM algorithms performance issues-CodePudding

I just begin to contact spark big data processing, recently doing some data mining experiment, including a comparison algorithm is borrowed from the spark library of classic data mining algorithm FP - Growth, I here a cluster is three machines: 1 * master (2 cores, 4 g mem), 2 * worker (4 cores, 8 g mem), the data set is to dig experiments using IBM data generator to generate T40I10D100K. TXT, size is 14.76 MB, because the data set is repeated the number is less, so the support set to 1%, the partition number is 16, the mining process, appeared a stack overflow error, it should be because of the Growth process of FP - Tree recursive set up conditions, here I have a question, I add up two worker available memory of cluster has the look of 12 gb, how even could not handle the data of 15 MB, because the support through the low, or because FP - Growth of frequent itemsets mining is bound to cause such a large memory consumption?

Please experienced predecessors comment, thank you.

CodePudding user response:

This is called FP - Growth source

 object PFP {
Def main (args: Array [String]) : Unit={
Val texts=mutable. The Map (
//"T25I10D10K. TXT" - & gt; The List (0.005, 0.004, 0.003, 0.002), 
//"mushroom. TXT" - & gt; The List (0.01)) 
//"chess. TXT" - & gt; The List (0.4)) 
//"accidents. TXT" - & gt; The List (0.1)) 
//"T10I4D100K. TXT" - & gt; The List (0.005, 0.004, 0.003, 0.002, 0.001), 
//"T40I10D100K. TXT" - & gt; The List (0.01)) 
"Connect - 4. TXT" - & gt; The List (0.3)) 
//"kddcup99. TXT" - & gt; The List (0.0001, 0.00009, 0.00008, 0.00007, 0.00006)) 
//"USCensus. TXT" - & gt; The List (0.5)) 
//"connect - 4. TXT" - & gt; The List (0.5)) 

Val conf=new SparkConf (). SetAppName (" PFP_scala ") 
Val sc=new SparkContext (conf) 

Texts. Foreach {text=& gt; 
Val writer=new PrintWriter (new File ("/root/app/scala2.10/PFP/" + text. _1)) 
Val data=https://bbs.csdn.net/topics/sc.textFile ("/usr/local/eclipsews/" + text. _1) 
Val startTime=System. CurrentTimeMillis () 
Val the transactions: RDD [Array [String]]=data. The map (s=& gt; S.t rim. The split (' ')) 
Val ioTime=System. CurrentTimeMillis () - startTime 

The text) _2) foreach {support=& gt; 
For (i<0 to 0) {

Val time1=System. CurrentTimeMillis () 
Val FPG=new FPGrowth () 
. SetMinSupport (support) 
SetNumPartitions (16) 

Val model=FPG. Run (the transactions) 
Val process=Java. Lang. Runtime. GetRuntime. Exec ("/root/app/scala2.10/PFP checkHDFS. Sh ") 
Process. The waitFor (); 
Model. FreqItemsets. SaveAsTextFile ("/usr/local/PFP ") 

Val endTime=System. CurrentTimeMillis () 
Val mineTime=endTime - time1 
//hehe. Foreach {itemset=& gt; Println (itemset. Items. MkString ("/", ", ", "") +", "+ itemset. Freq)} 
Val time=mineTime + ioTime 
Writer. Write (" database: "+ text. _1 +" support: "+ support + + iotime" iotime: "+" mineTime "+ mineTime +" time: "+ time +" \ n ") 
} 
} 
Writer. The close () 
}