How can I efficiently find distances from each value to the next lower/higher value?-CodePudding

I'll tell you what structures I am using, please feel free to recommend any changes like numpy arrays or something.

Anyways what I have is a list of 5 million sequential entries that correspond to a stock price.

I then have 2 more lists, each of these is the same length - 5 million entries. These lists correspond to an anticipated "upper limit" and anticipated "lower limit" that I expect the stock to reach from that point in the sequence.

What I want to do is go through all 5 million entries in the lower limit list, and record how long in the sequence it takes until the price finally hits that lower limit. I then want to do the same for the upper limit list.

Here is an example of a potential solution for a stock price list with only 10 entries:

prices =       [15,16,18,22,23,17,15,19,15,18]
upper_limits = [17,18,21,23,25,22,18,21,18,20]
lower_limits = [14,15,16,18,19,15,13,17,14,16]


solved_upper = [2,1,1,1,x,x,1,x,1,x]
solved_lower = [x,5,4,2,1,1,x,1,x,x]

#I think I got this right?  Anyways as you can see, the solved lists simply show
#how many entries we have to look at until we find a value that is >= to it for upper, or <= to it
#for lower

So the question is, how to solve this problem reasonably quickly for a huge amount of entries? (And actually, I have 10 upper limit lists and 10 lower limit lists.. so even more efficiency is needed)

Thank you so much for any help!

CodePudding user response：

I am going for clarity over efficiency. Replacing the Dictionary objects with real data objects is probably a good idea.

First we need to turn your timeseries into a searchable tree.

def make_tree (series, i=None, j=None):
    if i is None:
        i = 0
    if j is None:
        j = len(series) - 1

    if i == j:
        return {
            "min_i": i,
            "max_i": i,
            "min_value": series[i],
            "max_value": series[i],
            "left": None,
            "right": None
        }
    else:
        mid = (i   j) // 2
        left = make_tree(series, i, mid)
        right = make_tree(series, mid 1, j)
        return {
            "min_i": i,
            "max_i": j,
            "min_value": min(left['min_value'], right['min_value']),
            "max_value": max(left['max_value'], right['max_value']),
            "left": left,
            "right": right
        }

Next we need functions to search that tree:

def find_next_after_at_least(tree, min_i, min_value):
    if tree['max_i'] <= min_i or tree['max_value'] < min_value:
        return None
    elif tree['min_i'] == tree['max_i']:
        return tree['min_i'] - min_i
    else:
        answer = find_next_after_at_least(tree['left'], min_i, min_value)
        if answer is None:
            answer = find_next_after_at_least(tree['right'], min_i, min_value)
        return answer

def find_next_after_at_most(tree, min_i, max_value):
    if tree['max_i'] <= min_i or max_value < tree['min_value']:
        return None
    elif tree['min_i'] == tree['max_i']:
        return tree['min_i'] - min_i
    else:
        answer = find_next_after_at_most(tree['left'], min_i, max_value)
        if answer is None:
            answer = find_next_after_at_most(tree['right'], min_i, max_value)
        return answer

Now your searches can be written easily:

def solve_upper(tree, limits):
    return [
        find_next_after_at_least(tree, i, limits[i])
            for i in range(len(limits))
    ]

def solve_lower(tree, limits):
    return [
        find_next_after_at_most(tree, i, limits[i])
            for i in range(len(limits))
    ]

And now your example problem:

t = make_tree([15,16,18,22,23,17,15,19,15,18])
print(solve_upper(t, [17,18,21,23,25,22,18,21,18,20]))
print(solve_lower(t, [14,15,16,18,19,15,13,17,14,16]))

CodePudding user response：

You can solve this efficiently (in O(N log N) time) using a data structure similar to what is called a "monotonic queue". You can google that, but the usual use case is pretty different from yours, so I'll just explain. (strangely, this is the 3rd question I've seen here in a week where the answer needs a structure like this.)

In your case, you will work from the end of the prices array toward the start, adding each price to the front of the monotonic queue. Every time you put a price in, some others may be discarded, so that the queue only holds items that are larger than all the previous ones. These are the the only items that could possibly be the 'next higher price'. They are also monotonically increasing in the queue, so you can find the first one >= the target using a binary search. Since you need to know the index of the next higher value, you can store the indexes instead of the values themselves.

That solves the upper limit problem. The lower limit is the similar, but with the queue monotonically decreasing instead.

In python, it looks like this:

def solve_upper(prices, limits):
    solved = [0]*len(prices)
    q = [0]*len(prices)
    qstart = len(q)
    for i in range(len(prices)-1, -1, -1):
        price = prices[i]
        while qstart < len(q) and prices[q[qstart]] <= price:
            # the price at the start of q needs to be discarded, since
            # it isn't greater than the new one
            qstart  = 1
        # prepend the new price
        qstart -= 1
        q[qstart] = i
        limit = limits[i]

        # binary search to find the first price >= limit
        minpos = qstart
        maxpos = len(q)
        while minpos < maxpos:
            testpos = minpos   (maxpos - minpos)//2
            if prices[q[testpos]] < limit:
                # too low
                minpos = testpos 1
            else:
                # high enough
                maxpos = testpos
        if minpos < len(q):
            solved[i] = q[minpos]-i
        else:
            solved[i] = None
    return solved

def solve_lower(prices, limits):
    solved = [0]*len(prices)
    q = [0]*len(prices)
    qstart = len(q)
    for i in range(len(prices)-1, -1, -1):
        price = prices[i]
        while qstart < len(q) and prices[q[qstart]] >= price:
            # the price at the start of q needs to be discarded, since
            # it isn't less than the new one
            qstart  = 1
        # prepend the new price
        qstart -= 1
        q[qstart] = i
        limit = limits[i]

        # binary search to find the first price <= limit
        minpos = qstart
        maxpos = len(q)
        while minpos < maxpos:
            testpos = minpos   (maxpos - minpos)//2
            if prices[q[testpos]] > limit:
                # too low
                minpos = testpos 1
            else:
                # high enough
                maxpos = testpos
        if minpos < len(q):
            solved[i] = q[minpos]-i
        else:
            solved[i] = None
    return solved

prices =       [15,16,18,22,23,17,15,19,15,18]
upper_limits = [17,18,21,23,25,22,18,21,18,20]
lower_limits = [14,15,16,18,19,15,13,17,14,16]
print(solve_upper(prices, upper_limits))
print(solve_lower(prices, lower_limits))

Output:

[2, 1, 1, 1, None, None, 1, None, 1, None]
[None, 5, 4, 2, 1, 1, None, 1, None, None]

NOTE: If you benchmark this answer against @btilly's, please include the results in a comment!