how to update multiple copies of multi dimensional numpy array in an efficient way-CodePudding

Main goal: Suppose I have a multi-dimensional array. I also have a 0-1 index set corresponding to each column for each row. For example, If my array is [[3,6,7,8], [1,32,45,7]], I will have an index set as [[1,0,1,1], [0,0,1,1]]. I would like to take a copy of each row of my array n times. Then, I'd like to increase each element whose corresponding index is equal to 1 randomly.

import time
import random
import numpy as np

def foo(arr, upper_bound, index_set, first_set_size, sec_set_size, limit):
    iter =0

    my_array = np.zeros((first_set_size*sec_set_size, limit)) #each row is copied |sec_set_size| times
    it =0
    for i in range(first_set_size):
        for j in range(sec_set_size):
            my_array[it] = arr[i] #copy the elements from the corresponding row
            for k in range(limit):
                if index_set[i][k]==1: #update the elements whose indices are one
                    temp = arr[i][k]   #get the current value
                    my_array[it][k]  =temp   random.randint(1,upper_bound-temp) #I use fastrand.pcg32bounded here. Update the value. 
            it  =1
    return my_array


upper_bound = 50
limit = 1000
first_set_size= 100
sec_set_size = 50
arr = np.random.randint(25, size=(first_set_size, limit)) #create an array containing integer numbers
index_set= np.array([[random.randint(0,1) for j in range(limit)] for i in range(first_set_size)]) #each elements has an index which is either 1 or 0

start_time = time.time() #measure the time taken by the function
result = foo(arr, upper_bound,index_set, first_set_size, sec_set_size, limit)
print("time taken: %s " % (time.time() - start_time))

Once I increase the limit and set sizes, the code takes several minutes. Is there any way that I can perform this operation faster / efficiently? I've spent quite a bit of time on this, but could not improve the speed of my implementation.

EDIT: Suppose my initial array is:

[[11 23 24 17  0]
 [ 1 23 12 19  5]
 [20 15  1 17 17]
 [ 3  8  7  0 24]]

Also, my index set is given as;

[[1 0 0 0 1]
 [1 0 1 0 0]
 [1 1 1 1 0]
 [0 1 0 1 1]]

If sec_set_size=5, I would like to take the copy of each row and increase the values of each element if their indices are one.

The final result should be like this;

[[39. 23. 24. 17. 44.]
 [50. 23. 24. 17. 27.]
 [42. 23. 24. 17. 24.]
 [45. 23. 24. 17. 11.]
 [49. 23. 24. 17. 43.]
 [23. 23. 44. 19.  5.]
 [10. 23. 37. 19.  5.]
 [14. 23. 29. 19.  5.]
 [12. 23. 22. 19.  5.]
 [ 5. 23. 15. 19.  5.]
 [36. 45. 26. 37. 17.]
 [24. 40. 35. 38. 17.]
 [34. 20. 24. 31. 17.]
 [27. 16.  9. 20. 17.]
 [37. 37.  6. 37. 17.]
 [ 3. 50.  7. 46. 47.]
 [ 3. 13.  7. 37. 44.]
 [ 3. 23.  7. 32. 29.]
 [ 3. 10.  7. 22. 41.]
 [ 3. 22.  7. 32. 41.]]

CodePudding user response：

Numpy is all about vectorization. If you're using python loops, you're probably doing it wrong.

First off, all the random number generators are vectorized:

index_set = np.random.randint(2, size=(first_set_size, limit), dtype=bool)

You did it correctly on the line above.

Next, to copy rows multiple times, you can use np.repeat:

my_array = np.repeat(arr, sec_set_size, axis=0)

Notice that you don't need first_set_size at all. It's redundant with arr.shape[0]. You can do the same with your boolean mask to make the shapes match:

index_set = np.repeat(index_set, sec_set_size, axis=0)

Now you can update the option of my_array masked by index_set with an appropriate number of randomly generated elements:

my_array[index_set]  = np.random.randint(1, upper_bound - my_array[index_set])

Your entire program reduces to about four (very fast) lines, plus some initialization:

def foo(arr, upper_bound, index_set, sec_set_size, limit):
    my_array = np.repeat(arr, sec_set_size, axis=0)
    index_set = np.repeat(index_set, sec_set_size, axis=0)
    my_array[index_set]  = np.random.randint(1, upper_bound - my_array[index_set])
    return my_array

upper_bound = 50
limit = 1000
first_set_size= 100
sec_set_size = 50
arr = np.random.randint(25, size=(first_set_size, limit)) #create an array containing integer numbers
index_set = np.random.randint(2, size=(first_set_size, limit), dtype=bool)

start_time = time.time() #measure the time taken by the function
result = foo(arr, upper_bound, index_set, sec_set_size, limit)
print(f"time taken: {time.time() - start_time}")

You may want to experiment with using indices instead of a boolean mask. It will make indexing more efficient since the number of non-zero elements don't need to be recomputed twice, but on the other hand the setup is a bit more expensive:

def foo(arr, upper_bound, index_set, sec_set_size, limit):
    my_array = np.repeat(arr, sec_set_size, axis=0)
    r, c = np.where(index_set)
    r = (sec_set_size * r[:, None]   np.arange(sec_set_size)).ravel()
    c = np.repeat(c, sec_set_size)
    my_array[r, c]  = np.random.randint(1, upper_bound - my_array[r, c])
    return my_array