Home > Software engineering >  Find indices in each column of a 2d array for every element in a 1d array without looping
Find indices in each column of a 2d array for every element in a 1d array without looping

Time:02-02

I'm trying to do a Monte Carlo simulation of a contest and allocate points based on standings for each iteration of the sim. I currently have a working solution using Numpy's argwhere, however for large contest sizes and simulations (e.g. 25000 contestants and 10000 simulations) the script is extremely slow due to list comprehensions.

import numpy as np 

#sample arrays, the actual sizes are arbitrary based on the number of contestants (ranks.shape[1]) and simulations (ranks.shape[0]) but len(field) is always equal to number of contestants and len(payout_array)

#points to be allocated based on finish position
points_array = np.array([100,50,0,0,0])

#standings for the first simulation = [2,4,3,1,0], the second = [4,0,1,2,3]
ranks = np.array([[2, 4, 4, 1],
 [4, 0, 0, 4],
 [3, 1, 1, 0],
 [1, 2, 3, 2],
 [0, 3, 2, 3]])

field = np.arange(5)

ind = [np.argwhere(ranks==i) for i in field]
roi = [np.sum(points_array[i[:,0]]) for i in ind]
print(roi)

returns: [100, 100, 100, 0, 300]

This solution works and is very fast for small arrays:

%timeit [np.argwhere(ranks==i) for i in field]
36.2 µs ± 1.08 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

However for large contestant fields and many simulations the script hangs on the list comprehension using argwhere (still running after 20 minutes for a 30k person field and 10k simulations with no memory constraints on my machine). Is there a way to vectorize argwhere or reduce the complexity of the lookup to help speed up finding the indices in ranks for all of the elements in field?

CodePudding user response:

You can use numpy.argsort to vectorize this behavior. This works because your final array is in order of the contestants, and you need to find the index of their finish position based on the rank array.

import numpy as np

#sample arrays, the actual sizes are arbitrary based on the number of contestants (ranks.shape[1]) and simulations (ranks.shape[0]) but len(field) is always equal to number of contestants and len(payout_array)

#points to be allocated based on finish position
points_array = np.array([100,50,0,0,0])

#standings for the first simulation = [2,4,3,1,0], the second = [4,0,1,2,3]
ranks = np.array([[2, 4, 4, 1],
 [4, 0, 0, 4],
 [3, 1, 1, 0],
 [1, 2, 3, 2],
 [0, 3, 2, 3]])

roi = points_array[np.argsort(ranks, axis=0)].sum(axis=1)
print(roi)
  • Related