Home > front end >  Find all the times in a (very large) array that have a difference greater than x
Find all the times in a (very large) array that have a difference greater than x

Time:10-15

My array is time, so it is sorted and increasing.

I have to pull out the beginning/end where the difference in the array is greater than 30. The problem which other solutions don't cover, is that the array is thousands of values so looping through the array seems inefficient.

hugeArr = np.array([0, 2.072, 50.0, 90.0, 91.1])

My desired output for the above array would be something like: (2.072,50) (50,90).

Is there a way to accomplish this?

CodePudding user response:

You can use np.diff and np.where to find the correct indices:

>>> idxs = np.where(np.diff(hugeArr) > 30)[0]
>>> list(zip(hugeArr[idxs], hugeArr[idxs   1]))
[(2.072, 50.0), (50.0, 90.0)]

(Assuming you require only consecutive values)

And as @not_speshal mentioned, you can use np.column_stack instead of list(zip(...)) to stay within NumPy boundaries:

>>> np.column_stack((hugeArr[idxs], hugeArr[idxs 1]))
array([[ 2.072, 50.   ],
       [50.   , 90.   ]])

CodePudding user response:

Try to think about what your'e trying to do. For each value in the array if the next value is larger by more then 30 you'd like to save the tuple of them.

The key words here are for each. This is a classic O(n) complexity algorithm, so decreasing its time complexity seems impossible to me.

However, you can make changes specific to your array to make the algorithm faster.

For example, if your'e looking for a difference of 30 and you know that the average difference is 1, you might be better off looking for index i at

difference = hugeArr[i 15] - hugeArr[i]

and see if this is bigger then 30. If it isn't (and it probably won't be), you can skip these 15 indices as you know that no gap between two consecutive values is larger then the big gap.

If this works for you, run tests, 15 is completely arbitrary and maybe your magic number is 25. Change it a bit and time how long your function takes to run.

CodePudding user response:

A strategy that comes to mind is that we don't have to check numbers between two numbers that have a distance smaller than 30, we can do this because it is sorted. For example if the abs(hugeArr[0] - hugeArr[-1]) < 30 we dont have to check anything because nothing will have a distance of over 30.

We would start at the ends and work our way inwards. So check the starting number and ending number first. Then we go halfway hugeArr[len(hugeArr)//2] and check that number distance against the hugeArr[0] and hugeArr[-1]. Then we go into the ranges (hugeArr[0:len(hugeArr)//2] and hugeArr[len(hugeArr)//2:-1]). We break those two ranges again in half and wherever there is a distance from end to end smaller than 30 we don't check those. We can make this a recursive algorithm.

Worst case you'll have a distance over 30 everywhere and end up with O(n) but it could give you some advantage.

Something like this however you might want to refactor to numpy.

def check(arr):
    pairs = []
    
    def check_range(hugeArr):
        difference = abs(hugeArr[0] - hugeArr[-1])

        if difference < 30:
            return

        if len(hugeArr) == 2:
            pairs.append((hugeArr[0], hugeArr[1]))
            return 

        halfway = len(hugeArr)//2

        check_range(hugeArr[:halfway 1])
        check_range(hugeArr[halfway:])
        
    check_range(arr)
    return pairs
  • Related