Speed up python looping with numpy-CodePudding

I've tried as much as possible to find a suitable solution/answer on the forum but I'm pulling blanks - most probably due to incorrect terminology.

I currently have the following block of code that performs very poorly when performing the below operations on the array areas[] - the size of the array is upto 150,000 elements.

areas = [x .. x1]
prices = [y .. y1]
for key, value in enumerate(areas):
    compListPrices = [prices[ind] for ind, x in enumerate(areas) if x == value and ind != key]
    # Further operations using compListPrices here

From my understanding, the execution of this code could be improved considerably by leveraging numpy, but I'm struggling to convert it.

Any help would be much appreciated.

CodePudding user response：

Your implementation complexity is O(n²)

I would recommend using zip: O(n) then sort: O(n*log(n)) and a simple loop O(n) resulting in O(n*log(n)) complexity

Even though I'm not using numpy, a faster algorithm is usually faster than just move the logic to numpy

Here is an working example: https://abstra.show/yMFZqiUt8D

CodePudding user response：

You can do what you want in O(n). Like below:

>>> areas  = [1,2,3,4,1,1,3,2]
>>> prices = [100,10,20,50,30,40,20,20]

# zipping areas with thier indexs
>>> lst = list(zip(areas , range(len(areas))))
>>> lst
[(1, 0), (2, 1), (3, 2), (4, 3), (1, 4), (1, 5), (3, 6), (2, 7)]

# save each values exist in each indexs
>>> dct = {}
>>> for l in lst:
....    dct.setdefault(l[0], []).append(l[1])
>>> dct
{1: [0, 4, 5], 2: [1, 7], 3: [2, 6], 4: [3]}

# find prices from their indexs
>>> res = [[] for _ in range(len(areas))]
>>> for k,v in dct.items():
....    for i in v:
....        res[i] = [prices[j] for j in v if i!=j]
>>> res
[[30, 40], [20], [20], [], [100, 40], [100, 30], [20], [10]]

Benchmark on Colab:

from numpy.random import randint
areas = randint(0, 10, 10_000)
prices = randint(0, 1000, 10_000)

def method_Npow2(areas, prices):
    res = []
    for key, value in enumerate(areas):
        res.append([prices[ind] for ind, x in enumerate(areas) if x == value and ind != key])
    return res

def method_Npow1(areas, prices):
    lst = list(zip(areas , range(len(areas))))
    dct = {}
    for l in lst:
        dct.setdefault(l[0], []).append(l[1])

    res = [[] for _ in range(len(areas))]
    for k,v in dct.items():
        for i in v:
            res[i] = [prices[j] for j in v if i!=j]    
    return res

Output:

%timeit method_Npow2(areas, prices)
# 1 loop, best of 5: 13.9 s per loop

%timeit method_Npow1(areas, prices)
# 1 loop, best of 5: 1.76 s per loop

m1 = method_Npow2(areas, prices)
m2 = method_Npow1(areas, prices)
np.all(np.asarray(m1) == np.asarray(m2))
# True