A better than O(N) solution for searching a vector of sorted intervals-CodePudding

Given a set of sorted intervals (first >= second), sorted by the first element of the interval:

{1, 3}, {1, 2}, {2, 4}, {2, 2}, {2, 3}, {3, 5}, {3, 3}, {3, 7}

is there an efficient algorithm for determining the first interval that intersects a given input interval? For example:

Query ({0, 0}) = returns end()
Query ({2, 4}) = returns iterator to element 0
Query ({3, 8}) = returns iterator to element 0
Query ({4, 9}) = returns iterator to element 2
Query ({7, 8}) = returns iterator to element 7
Query ({8, 9}) = returns end()

By efficient I mean better than O(N). I have a vague feeling there's a lower_bound or upper_bound solution to this problem but I don't have the mental horsepower to work out what it is.

This is the O(N) solution that I'm unsatisfied with.

#include <iostream>
#include <algorithm>
#include <vector>
#include <string>

int main()
{
    using Interval = std::pair<int, int>;
    using Sequence = std::vector<Interval>;
    using Iterator = Sequence::const_iterator;

    auto Query = [](Sequence const & sequence, Interval const & interval) -> Iterator
    {
        return std::find_if(sequence.begin(), sequence.end(), [interval](Interval const & other) {
            return interval.first <= other.second && interval.second >= other.first;
        });
    };

    auto Print = [](Sequence const & sequence, Iterator const & iterator) -> void
    {
        if (iterator == sequence.cend())
        {
            std::cout << "end()\n";
        }
        else
        {
            std::cout << std::to_string(std::distance(sequence.cbegin(), iterator)) << "\n";
        }
    };

    Sequence sequence = {
        {1, 3}, { 1, 2 }, { 2, 4 }, { 2, 2 }, { 2, 3 }, { 3, 5 }, { 3, 3 }, { 3, 7 }
    };

    auto result = Iterator();

    result = Query(sequence, { 0, 0 });

    Print(sequence, result);

    result = Query(sequence, { 2, 4 });

    Print(sequence, result);

    result = Query(sequence, { 3, 8 });

    Print(sequence, result);

    result = Query(sequence, { 4, 9 });

    Print(sequence, result);

    result = Query(sequence, { 7, 8 });

    Print(sequence, result);

    result = Query(sequence, { 8, 9 });

    Print(sequence, result);
}

Output:

end()
0
0
2
7
end()

CodePudding user response：

There cannot be a faster than linear algorithm. Consider the input where the first value of each element is 0 and the second value is random uniform over some range. Consider the query to be {x,x 1} where x is in the same range.

Since you want "the first interval that intersects a given input", the sorting is now useless. You must scan it all.

Hence, you cannot beat O(N).

CodePudding user response：

As said in the comments, with the data structure outlined in the question you cannot do better than linear

You cannot afford to skip any interval whose left bound is lower than the right bound of your requested interval, as the right bound (over which there's no sort/constraint) can always assume a value that may make the examined interval intersect, so you cannot jump immediately to an "interesting zone" using binary search.

although you can have an early exit once you reach the interval whose left bound is greater than the query's right bound.

However, given that OP is ok with changing the data representation, this can be made much easier once you make the intervals non-overlapping; given that we are only interested in an intersects/doesn't intersect test, it's not much of a problem.

Starting with the sorted data, merging overlapping intervals is just a matter of a single linear pass to do once: keep a "current" element, and keep it growing as long as you read overlapping elements; once you get a disjoint one, push the "current" one and make this the new "current". This can be also easily be made to work inplace, as a variant of the usual remove_if algorithm.

Once you have sorted disjoint intervals, you can apply binary search to your problem to have O(log n) queries. Instead of a vector<pair> a more advantageous structure is a plain vector<int> with the (sorted) bounds flattened into a single sequence.

With this representation, a query goes like this:

do a upper_bound looking for the left bound of the query interval;
if you get end, then no bound is greater, so we have no intersection;
otherwise, get the index from the iterator:
- if the index is odd, it means that the first bound greater than the searched value is a right bound; that means that the query left bound falls inside an interval, thus we have an intersection;
- otherwise, the left bound of the query falls in a hole between two intervals, so we have to check the right bound of the query; the (even-indexed) element that was found by upper_bound is the left bound of the next interval: if it's less than the right bound of the query, then we have an intersection, otherwise the query interval falls fully inside the hole, so there's no intersection.