Fastest method for calculating index of element changes in Python list-CodePudding

I have the following Python list: [1, 1, 2, 2, 2, 3, 4, 4, 4]. I want to create a function that calculates the index of element changes. So, for example, the method would yield this result for the above list - [2, 5, 6] - since the first 2 occurs at index 2, the first 3 at index 5, and the first 4 at index 6.

Of course, there are numerous ways to do this. I'll be running this method millions of times daily over much longer lists, so I'm looking for the quickest solution possible.

Here's the work I've done so far:

idxs = [i for i in range(len(inputList))]
dict_ = {v: i for v, i in zip(inputList, idxs)}
result = [v   1 for v in dict_.values()]

However, there's a bug when inputting lists that have values that have been used previously, such as [1, 1, 2, 2, 2, 3, 4, 4, 4, 1, 1].

CodePudding user response：

You could enumerate over your values and check if the [i-1] value is different than the [i] value.

>>> inputList = [1, 1, 2, 2, 2, 3, 4, 4, 4]
>>> [i for i, val in enumerate(inputList) if i>0 and inputList[i-1]!= val]
[2, 5, 6]

and with your second example

>>> inputList = [1, 1, 2, 2, 2, 3, 4, 4, 4, 1, 1]
>>> [i for i, val in enumerate(inputList) if i>0 and inputList[i-1]!= val]
[2, 5, 6, 9]

This will run in O(N) time which is the fastest this type of algorithm could possibly execute.

CodePudding user response：

So a little bit of profiling on some largish data suggests reveals some improvements that can be made just by avoiding checking i > 0 each time in the list comprehension. It's comparable to numpy though we're probably losing a lot converting to numpy arrays and back:

import random
import timeit
import numpy as np

max_value = 100
min_run = 100
max_run = 1000
num_runs = 1000

test_list = [1, 1, 2, 2, 2, 3, 4, 4, 4]
test_result = [2, 5, 6]

input_list = sum(([random.randint(0, max_value)] * random.randint(min_run, max_run) for _ in range(num_runs)), [])

def method_1(input_list):
    return [i for i, val in enumerate(input_list) if i>0 and input_list[i-1]!= val]

def method_2(input_list):
    return [i for i in range(1, len(input_list)) if input_list[i-1] != input_list[i]]

def method_3(input_list):
    return [i   1 for i, (a, b) in enumerate(zip(input_list, input_list[1:])) if a != b]

def method_4(input_list):
    input_array = np.array(input_list)
    res, = np.where(input_array[:-1] != input_array[1:])
    res  = 1
    return list(res)

def method_5(input_list):
    return [i   1 for i, val in enumerate(input_list[1:]) if input_list[i]!= val]

assert method_1(test_list) == test_result
assert method_2(test_list) == test_result
assert method_3(test_list) == test_result
assert method_4(test_list) == test_result
assert method_5(test_list) == test_result

print(timeit.timeit("method_1(input_list)", globals=globals(), number=10))
print(timeit.timeit("method_2(input_list)", globals=globals(), number=10))
print(timeit.timeit("method_3(input_list)", globals=globals(), number=10))
print(timeit.timeit("method_4(input_list)", globals=globals(), number=10))
print(timeit.timeit("method_5(input_list)", globals=globals(), number=10))

This yields the result:

0.4418060999996669
0.3605320999995456
0.3416827999972156
0.2726910000019416
0.2845658000005642

CodePudding user response：

Here is an answer using only range and len, without using enumerate, otherwise similar to the answer by Cory Kramer:

inputList = [1, 1, 2, 2, 2, 3, 4, 4, 4, 1, 1]
idxs = [i for i in range(1, len(inputList)) if inputList[i-1] != inputList[i]]
print(idxs)
# [2, 5, 6, 9]

The solutions are similar in run time, but using only range and len, without using enumerate is somewhat faster:

import timeit

t = timeit.Timer("idxs = [i for i in range(1, len(inputList)) if inputList[i-1] != inputList[i]]",
                 "import random; random.seed(42); inputList = [random.randrange(4) for i in range(1000000)]")
print('range   len:', t.timeit(100))

t = timeit.Timer("[i for i, val in enumerate(inputList) if i>0 and inputList[i-1]!= val]",
                 "import random; random.seed(42); inputList = [random.randrange(4) for i in range(1000000)]")
print('enumerate:', t.timeit(100))
# range   len: 15.435243827
# enumerate: 17.243516137