I want to estimate the probability that a randomly drawn datapoint from array x will be greater than or equal to a randomly drawn point from array y. I want to do this by comparing all possible pairs of values.
I think that a naive implementation would look like this:
def probability_x_gte_y(array_x, array_y):
gte_counter = 0
n_comparisons = len(array_x) * len(array_y)
for vx in array_x:
for vy in array_y:
if vx >= vy:
gte_counter = 1
return gte_counter / n_comparisons
But I believe that there is a more efficient way to calculate this, especially given that the distribution of the two sets of points in array_x and array_y are likely to be quite well separated (put another way the overlap between the two 1-D arrays is likely to be small relative to the total range of points covered.)
CodePudding user response:
A much faster implementation is to sort one of the array so it is possible to find faster the number of values greater than a given one thanks to a binary search. This implementation runs in O(n log n)
while the original runs in O(n * n)
.
def probability_x_gte_y_opt2(array_x, array_y):
n_comparisons = len(array_x) * len(array_y)
sorted_x = np.sort(array_x)
gte_counter = n_comparisons - np.searchsorted(sorted_x, array_y).sum()
return gte_counter / n_comparisons
On random arrays of size 5000, this is about 3890 times faster on my machine (2.69s vs 0.69ms)!
Note that this is possible to write an algorithm running in O(n)
time: you can use a radix sort on the two array followed by a kind of custom counting merge between the two sorted array. However, Numpy does not implement a radix sort yet and a fast counting merge cannot be easily implemented with Numpy.
CodePudding user response:
If your arrays are small enough, you are correct that an exhaustive calculation would work. If your arrays are too big, then you could instead perform a simulation which would converge the correct probability, giving you an estimate.
If you can't exhaust and want a precise answer, then you need to break up the problem into smaller chunks. For example, suppose all points in X and Y are in the interval (0,1). If we break up that interval into ten sub-intervals, (0,0.1),...(0.9,1); then we could exhaust each sub-interval and work through the conditional probabilities. In theory, this could even be reduced down to intervals only containing single points, but I am assuming there will be a trade-off between exhaustion size and conditional probability tree size.