Programming and algorithm about big data ~ ~ ~ is very confused, help graphs and Spark a great god,-CodePudding

Today just began to study large data, an algorithm to write programs, but many do not understand, thought of a problem today, here to help, first thanked ~

Suppose there are 80 kinds of equipment, 10 kinds of fault types and 50000 kinds of parts, the data in the database, using (FP - growth, for example) for mining association rules, have a kind of equipment fault type with more than N a certain parts, set N> 600 were recorded,
This use usually select count (*) cycle where database query can also do it, but the cycle of 80 * 10 * 50000 times and than can do it,
But dimensions increase, the increase in the number, process time and the database IO are problem,

So to ask, this kind of algorithm should do,
1, database data how to deal with? If you want to handle to the mapping file how to do?
2, if use the way how to implement graphs?
3, if use the Spark to do? (Spark Mlib seems to have a FP - growth algorithm)

This is me with my big data beginner very confused, I hope you "lead" walked pass by to give directions, if you can please say something in detail, thank you.

CodePudding user response:

https://spark.apache.org/docs/latest/mllib-frequent-pattern-mining.html

CodePudding user response:

reference 1st floor java8964 response:

https://spark.apache.org/docs/latest/mllib-frequent-pattern-mining.html

To ask, I have to get out, put less than 10 lines of data in a text run the algorithm, why still feeling very slow, and there was a seven or eight seconds from running to the results, I thought the spark should soon, with the basic similar to the following
Val data=https://bbs.csdn.net/topics/sc.textFile (" data/mllib/sample_fpgrowth. TXT ")

Val the transactions: RDD [Array [String]]=data. The map (s=& gt; S.t rim. The split (' '))

Val FPG=new FPGrowth ()
. SetMinSupport (0.2)
SetNumPartitions (10)
Val model=FPG. Run (the transactions)

Model. FreqItemsets. Collect (). The foreach {itemset=& gt;
Println (itemset. Items. MkString ("/", ", ", "") +", "+ itemset. Freq)
}

Two from a main machine configuration, the master is 4 gb of memory, two worker is 2 g
On the configuration of the spark to the 1 g

Don't know why so slowly, look for the Java FPG algorithm written not to join the cluster can run faster than this