Find the indices where a sorted list of integer changes-CodePudding

Assuming a sorted list of integers as below:

data = [1] * 3   [4] * 5   [5] * 2   [9] * 3
# [1, 1, 1, 4, 4, 4, 4, 4, 5, 5, 9, 9, 9]

I want to find the indices where the values changes, i.e.

[3, 8, 10, 13]

One approach is to use itertools.groupby:

cursor = 0
result = []
for key, group in groupby(data):
    cursor  = sum(1 for _ in group)
    result.append(cursor)
print(result)

Output

[3, 8, 10, 13]

This approach is O(n). Another possible approach is to use bisect.bisect_left:

cursor = 0
result = []
while cursor < len(data):
    cursor = bisect_left(data, data[cursor]   1, cursor, len(data))
    result.append(cursor)
print(result)

Output

[3, 8, 10, 13]

This approach is O(k*log n) where k is the number of distinct elements. A variant of this approach is to use an exponential search.

Is there any faster or more performant way of doing this?

CodePudding user response：

When it comes to asymptotic complexity I think you can improve slightly on the binary search on average when you apply a more evenly spread divide-and-conquer approach: try to first pinpoint the value-change that occurs closer to the middle of the input list, thereby splitting the range in approximately two halves, which would reduce the next binary search operation path by about one.

Yet, because this is Python, the gain might not be noticeable, because of the Python-code overhead (like for yield, yield from, the recursion, ...). It might even perform worse for the list sizes you work with:

from bisect import bisect_left

def locate(data, start, end):
    if start >= end or data[start] == data[end - 1]:
        return
    mid = (start   end) // 2
    val = data[mid] 
    if val == data[start]:
        start = mid
        val  = 1
    i = bisect_left(data, val, start   1, end)
    yield from locate(data, start, i)
    yield i
    yield from locate(data, i, end)

data = [1] * 3   [4] * 5   [5] * 2   [9] * 3
print(*locate(data, 0, len(data)))  # 3 8 10

Note that this only outputs valid indices, so 13 is not included for this example input.

CodePudding user response：

I tested execution time of your approaches on two sets of data and added a third one using numpy

data1 = [1] * 30000000   [2] * 30000000   [4] * 50000000   [5] * 20000000   [7] * 40000000   [9] * 30000000   [11] * 10000000   [15] * 30000000
data2 = list(range(10000000))

cursor = 0
result = []
start_time = time.time()
for key, group in groupby(data):
    cursor  = sum(1 for _ in group)
    result.append(cursor)
print(f'groupby {time.time() - start_time} seconds')

cursor = 0
result = []
start_time = time.time()
while cursor < len(data):
    cursor = bisect_left(data, data[cursor]   1, cursor, len(data))
    result.append(cursor)
print(f'bisect_left {time.time() - start_time} seconds')

data = np.array(data)
start_time = time.time()
[i   1 for i in np.where(data[:-1] != data[1:])[0]]   [len(data)]
print(f'numpy {time.time() - start_time} seconds')

# We need to iterate over the results array to add 1 to each index for your expected results.

With data1

groupby 8.864859104156494 seconds
bisect_left 0.0 seconds
numpy 0.27180027961730957 seconds

With data2

groupby 3.602466583251953 seconds
bisect_left 5.440978765487671 seconds
numpy 2.2847368717193604 seconds

As you mentioned bisect_left is very much depends on the number of unique elements, but it seems using numpy has better performance than itertools.groupby even with the additional iteration on the indices list.