In transplant a thing from Hadoop to have a few problems on the Spark, Hadoop is this thing called Dedoop, is suitable for the data to the heavy or links, the principle of which is sorted data set using a window of size k traverse the entire data sets, through the calculation of the data in the interior of the window of the similarity between each other, Hadoop realize this is very convenient, only need to group after each reduce do a traversal, and then reduce the 4 heads out and do it again it is good to reduce, but on the Spark met this window traversal problem,
Because the spark is to operate the RDD as a whole, so I think this window algorithm should be very easy to implement, but just contact spark, check a lot of information, did not find suitable way, time is pressing, so can only refer to everybody!
's going on now as the key, the data sorted foreach traverse the direct use of a window, so at least on each machine can guarantee the machine the data above is calculated, but the head of the data set on each machine tail is very difficult, the data of the ith machine tail to the I + 1 machine head on a calculation of data on the spark didn't find a better way,
Hope you grant instruction!!!!!! Thank you very much!