Data Structure to convert a stream of integers into a list of range-CodePudding

A function is given with a method to get the next integer from a stream of integers. The numbers are fetched sequentially from the stream. How will we go about producing a summary of integers encountered till now? Given a list of numbers, the summary will consist of the ranges of numbers. Example: The list till now = [1,5,4,2,7] then summary = [[1-2],[4-5],7]
Put the number in ranges if they are continuous.

My Thoughts:
Approach 1:
Maintain the sorted numbers. So when we fetch a new number from a stream, we can use binary search to find the location of the number in the list and insert the element so that the resulting list is sorted. But since this is a list, I think inserting the element will be an O(N) operation.

Approach 2:
Use Balanced binary search trees like Red, Black, or AVL. Each insertion will be O(log N)
and in order will yield the sorted array from which one can compute the range in O(N)

Approach 2 looks like a better approach if I am not making any mistakes. I am unsure if there is a better way to solve this issue.

CodePudding user response：

I'd not keep the original numbers, but aggregate them to ranges on the fly. This has the potential to reduce the number of elements by quite some factor (depending on the ordering and distribution of the incoming values). The task itself seems to imply that you expect contiguous ranges of integers to appear quite frequently in the input.

Then a newly incoming number can fall into one of a few cases:

It is already contained in some range: then simply ignore the number (this is only relevant if duplicate inputs can happen).
It is adjacent to none of the ranges so far: create a new single-element range.
It is adjacent to exactly one range: extend that range by 1, downward or upward.
It is adjacent to two ranges (i.e. fills the gap): merge the two ranges.

For the data structure holding the ranges, you want a good performance for the following operations:

Find the place (position) for a given number.
Insert a new element (range) at a given place.
Merge two (neighbor) elements. This can be broken down into:
- Remove an element at a given place.
- Modify an element at a given place.

Depending on the expected number und sparsity of ranges, a sorted list of ranges might do. Otherwise, some kind of search tree might turn out helpful.

Anyway, start with the most readable approach, measure performance for typical cases, and decide whether some optimization is necessary.

CodePudding user response：

I suggest maintaining a hashmap that maps each integer seen so far to the interval it belongs to.

Make sure that two numbers that are part of the same interval will point to the same interval object, not to copies; so that if you update an interval to extend it, all numbers can see it.

All operations are O(1), except the operation "merge two intervals" that happens if the stream produces integer x when we have two intervals [a, x - 1] and [x 1, b]. The merge operation is proportional to the length of the shortest of these two intervals.

As a result, for a stream of n integers, the algorithm's complexity is O(n) in the best-case (where at most a few big merges happen) and O(n log n) in the worst-case (when we keep merging lots of intervals).

In python:

def add_element(intervals, x):
        if x in intervals: # do not do anything
            pass
        elif x   1 in intervals and x - 1 in intervals: # merge two intervals
            i = intervals[x - 1]
            j = intervals[x   1]
            if i[1]-i[0] > j[1]-j[0]: # j is shorter: update i, and make everything in j point to i
                i[1] = j[1]
                for y in range(j[0] - 1, j[1] 1):
                    intervals[y] = i
            else: # i is shorter: update j, and make everything in i point to j
                j[0] = i[0]
                for y in range(i[0], i[1]   2):
                    intervals[y] = j
        elif x   1 in intervals: # extend one interval to the left
            i = intervals[x   1]
            i[0] = x
            intervals[x] = i
        elif x - 1 in intervals: # extend one interval to the right
            i = intervals[x - 1]
            i[1] = x
            intervals[x] = i
        else: # add a singleton
            intervals[x] = [x,x]
        return intervals

from random import shuffle
def main():
    stream = list(range(10)) * 2
    shuffle(stream)
    print(stream)
    intervals = {}
    for x in stream:
        intervals = add_element(intervals, x)
        print(x)
        print(set(map(tuple, intervals.values()))) # this line terribly inefficient because I'm lazy

if __name__=='__main__':
    main()

Output:

[1, 5, 8, 3, 9, 6, 7, 9, 3, 0, 6, 5, 8, 1, 4, 7, 2, 2, 0, 4]
1
{(1, 1)}
5
{(1, 1), (5, 5)}
8
{(8, 8), (1, 1), (5, 5)}
3
{(8, 8), (1, 1), (5, 5), (3, 3)}
9
{(8, 9), (1, 1), (5, 5), (3, 3)}
6
{(3, 3), (1, 1), (8, 9), (5, 6)}
7
{(5, 9), (1, 1), (3, 3)}
9
{(5, 9), (1, 1), (3, 3)}
3
{(5, 9), (1, 1), (3, 3)}
0
{(0, 1), (5, 9), (3, 3)}
6
{(0, 1), (5, 9), (3, 3)}
5
{(0, 1), (5, 9), (3, 3)}
8
{(0, 1), (5, 9), (3, 3)}
1
{(0, 1), (5, 9), (3, 3)}
4
{(0, 1), (3, 9)}
7
{(0, 1), (3, 9)}
2
{(0, 9)}
2
{(0, 9)}
0
{(0, 9)}
4
{(0, 9)}