Home > other >  Spark RDD flatmap performance is slow, for help
Spark RDD flatmap performance is slow, for help

Time:09-23

When using RDD flatmap function, if flatmap function returns an array of objects in a lot of, such as dozens of hundreds of, can lead to Spark run special slow, sample code:
Val rdd2=sc. TextFile (strInputFilePath) flatMap (line=& gt; {
Var grids=new ArrayBuffer [Tuple2 [Tuple2/Int, Int, Double]] ()
Val values=line. The split () ", "
If (values. The length & gt; 3) {
Try {
Val br=broadcast. The value
Val x=values (br) _6). ToDouble
Val y=values (br. _7). ToDouble

Val point=new SPoint2D (x, y)

If (br) _1) the contains (point)) {
Val xmin=point. X-ray br. _2.
Val ymin=point) y - br) _2.
Val xmax=point) x + br) _2.
Val ymax=point) y + br) _2.

Val rcBounds=br. _1;
Val IndexCols=br. _4
Val IndexRows=br. _5

Var col1=(Math) floor ((xmin - rcBounds. GetLeft ())
/resolution). ToInt;
Var col2=(Math) floor ((xmax - rcBounds. GetLeft ())
/resolution). ToInt;
Var row1=- (Math) floor ((ymax - rcBounds. GetTop ())
/resolution). ToInt;
Var row2=- (Math) floor ((ymin - rcBounds. GetTop ())
/resolution). ToInt;

If (col1 & lt; 0 {
Col1=0;
} else if (col1 & gt;={IndexCols)
Col1=IndexCols - 1;
}
If (col2 & lt; 0 {
Col2=0;
} else if (col2 & gt;={IndexCols)
Col2=IndexCols - 1;
}
If (row1 & lt; 0 {
Row1=0;
} else if (row1 & gt;={IndexRows)
Row1=IndexRows - 1;
}
If (row2 & lt; 0 {
Row2=0;
} else if (row2 & gt;={IndexRows)
Row2=IndexRows - 1;
}

If (col1 & gt; Col2) {
Var temp=col2;
Col2=col1;
Col1=temp;
}
If (row1 & gt; Row2) {
Var temp=row2;
Row2=row1;
Row1=temp;
}

For (col & lt; - col1. To (col2); The row & lt; - row1. To (row2)) {
Val xtemp=rcBounds. GetLeft () + col * br in _3 + 0.5 * br in _3;
Val ytemp=rcBounds. GetTop () - row * br in _3-0.5 * br. _3;

Val short=Math. SQRT ((x - xtemp)
* (x - xtemp) + (y - ytemp)
* (y - ytemp));
If (short & lt;=(br) _2) {
Val disPre=short/br. _2;
Val valuePre=1.0 * 3 * math.h pow (1 - disPre * disPre, 2)/br _8;
Grids +=new Tuple2 ((the row, col), valuePre)
}
}
}
} catch {
Case ex: Exception=& gt;
Ex. PrintStackTrace ()
}
}
Grids. ToArray [Tuple2 [Tuple2/Int, Int, Double]]
})

In the code above, if grids in the ArrayBuffer object number is more, this code will run very slow, excuse me everybody, do you have any good solution and method,

CodePudding user response:

What is your problem solved?

CodePudding user response:

I don 't think the Spark is missile, but your code is missile.

You should mark your method (The whole line=& gt; {your implementation here}), to see how long it run. If it run for 1 s, then assuming that your text file has 1000000 lines, you will run 1000000 seconds in the sequence. So even the spark can it as 1000 concurrently, your whole job will still take 1000 seconds.

So my the suggestion is:
1) Try to benchmark how long your method runs? Try to improve it. I didn 't read it carefully, but it looks like. It can improve The code smells bad
2) If there is no way to improve the method, then your place need from each line is:
Val x=values (br) _6). ToDouble
Val y=values (br. _7). ToDouble
So you can distinct the real combination of (values (br) _6), values (br) _7) first, So each unique combination will only run from your method once, home as right now.
  • Related