The Spark RRD statistical operations-CodePudding

Data set data structure website request IP
For example,
www.Abaidu.com 192.168.1.101
www.Abaidu.com 192.168.1.102
www.Abaidu.com 192.168.1.103
www.Ataobao.com 192.168.1.101
www.Ataobao.com 192.168.1.102
www.Ajd.com 192.168.1.101
Want the final result is
Coincidence rate of IP number 2
www.Abaidu.com-www.Ataobao.comCoincidence rate of IP number 1
www.Abaidu.com-www.Ajd.comCoincidence rate of IP number 1
www.Ataobao.com-www.Ajd.comWith the spark RRD should be how to deal with,
My small white, let a train of thought,

CodePudding user response:

In the spark - shell try:
Val array=array (array (" www.Abaidu.com ", "192.168.1.101"),
Array (" www.Abaidu.com ", "192.168.1.102"),
Array (" www.Abaidu.com ", "192.168.1.103"),
Array (" www.Ataobao.com ", "192.168.1.101"),
Array (" www.Ataobao.com ", "192.168.1.102"),
Array (" www.Ajd.com ", "192.168.1.101"))

Sc. Parallelize (array). The cartesian (sc) parallelize (array). The map (r=& gt; {
If (r. _1 (0).=r. _2 (0) & amp; &==r. r. _1 (1) _2 (1)) {
(r. _1 (0) + "-" + r. _2 (0), (1)
}
The else {
(" nothing ", 1)
}
}
). The filter (r=& gt; R!!!=(" nothing ", 1)). ReduceByKey ((a, b)=& gt; A + b). The map (r=& gt; (r. _1. Split (" - "). SortWith (_ & gt; _) (0), r. _1. Split (" - "). SortWith (_ & gt; _) (1), r.) _2) distinct () collect