Finding closest pair of points in the plane with non-distinct x-coordinates in O(nlogn)-CodePudding

Most of the implementations of the algorithm to find the closest pair of points in the plane that I've seen online have one of two deficiencies: either they fail to meet an O(nlogn) runtime, or they fail to accommodate the case where some points share an x-coordinate. Is a hash map (or equivalent) required to solve this problem optimally?

Roughly, the algorithm in question is (per CLRS Ch. 33.4):

For an array of points P, create additional arrays X and Y such that X contains all points in P, sorted by x-coordinate and Y contains all points in P, sorted by y-coordinate.
Divide the points in half - drop a vertical line so that you split X into two arrays, X_L and X_R, and divide Y similarly, so that Y_L contains all points left of the line and Y_R contains all points right of the line, both sorted by y-coordinate.
Make recursive calls for each half, passing X_L and Y_L to one and X_R and Y_R to the other, and finding the minimum distance, d in each of those halves.
Lastly, determine if there's a pair with one point on the left and one point on the right of the dividing line with distance smaller than d; through a geometric argument, we find that we can adopt the strategy of just searching through the next 7 points for every point within distance d of the dividing line, meaning the recombination of the divided subproblems is only an O(n) step (even if it looks n² at first glance).

This has some tricky edge cases. One way people deal with this is sorting the strip of points of distance d from the dividing line at every recombination step (e.g. here), but this is known to result in an O(nlog²n) solution.

Another way people deal with edge cases is by assuming each point has a distinct x-coordinate (e.g. here): note the snippet in closestUtil which adds to Pyl (or Y_L as we call it) if the x-coordinate of a point in Y is <= the line, or to Pyr (Y_R) otherwise. Note that if all points lie on the same vertical line, this would result us writing past the end of the array in C , as we write all n points to Y_L.

So the tricky bit when points can have the same x-coordinate is dividing the points in Y into Y_L and Y_R depending on whether a point p in Y is in X_L or X_R. The pseudocode in CLRS for this is (edited slightly for brevity):

for i = 1 to Y.length
    if Y[i] in X_L
        Y_L.length = Y_L.length   1;
        Y_L[Y_L.length] = Y[i]
    else Y_R.length = Y_R.length   1;
        Y_R[Y_R.length] = Y[i]

However, absent of pseudocode, if we're working with plain arrays, we don't have a magic function that can determine whether Y[i] is in X_L in O(1) time. If we're assured that all x-coordinates are distinct, sure - we know that anything with an x-coordinate less than the dividing line is in X_L, so with one comparison we know what array to partition any point p in Y into. But in the case where x-coordinates are not necessarily distinct (e.g. in the case where they all lie on the same vertical line), do we require a hash map to determine whether a point in Y is in X_L or X_R and successfully break down Y into Y_L and Y_R in O(n) time? Or is there another strategy?

CodePudding user response：

Yes, there are at least two approaches that work here.

The first, as Bing Wang suggests, is to apply a rotation. If the angle is sufficiently small, this amounts to breaking ties by y coordinate after comparing by x, no other math needed.

The second is to adjust the algorithm on G4G to use a linear-time partitioning algorithm to divide the instance, and a linear-time sorted merge to conquer it. Presumably this was not done because the author valued the simplicity of sorting relative to the previously mentioned algorithms in most programming languages.