using integer as index for multidimensional numpy array-CodePudding

I have boolean array of shape (n_samples, n_items) which represents a set: my_set[i, j] tells if sample i contains item j.

To populate it, the array is initialized as zeros, and receive another array of integers, with shape (n_samples, 3), telling for each example, three elements that belongs to it, for instance:

my_set = np.zeros((2, 5), dtype=bool)
init_values = np.array([[1,3,4], [0,1,2]], dtype=np.int64)

So, I need to fill my_set in row 0 and columns 1, 3, 4 and in row 1, columns 0, 1, 2, with with ones.

my_set contain valid values in appropriated range (that is, in [0, n_items)), and each column doesn't contain duplicated items.

Some failed approaches:

I know that a list of integers (or array) can be used as index, so I tried to use init_values as index straightforward, but it failed:

my_set[init_values] = 1
  File "<ipython-input-9-9b2c4d19f4f6>", line 1, in <cell line: 1>
    my_set[init_values] = 1
IndexError: index 3 is out of bounds for axis 0 with size 2

I don't know why the 3 is indexing over the first axis, so I tried a second approach: "pick up all rows and index only desired columns", using a mix of slicing and integer index. And it didn't throw error, but didn't worked as expected: checkout the shape, I expect it to be (2, 3), however...

my_set[:, init_values].shape
Out[11]: (2, 2, 3)

Not sure why it didn't work, but at least the first axis looks correct, so I tried to pick up only the first column, which is a list of integers, and therefore it is "more natural"... once again, it didn't worked:

my_set[:, init_values[:,0]].shape
Out[12]: (2, 2)

I expected this shape to be (2, 1) since I wanted all rows with a single column on each, corresponding to the indexes given in init_values.

I decided to go back to integer index approach for the first axis.... and it worked:

my_set[np.arange(len(my_set)), init_values[:,0]].shape
Out[13]: (2,)

However, it only works wor one column, so I need to iterate over columns to make it really work, but it looks like a good-initial workaround.

Current solution

So, to solve my original problem, I wrote this:

for c in range(init_values.shape[1])
    my_set[np.arange(len(my_set)), init_values[:,c]] = 1

# now lets check my_set is properly filled
print(my_set)
Out[14]: [[False  True False  True  True]
          [ True  True  True False False]]

which is exactly what I need.

Question(s):

That said, here goes my main question:

Is there a more efficient way to do this? I see it quite inefficient as the number of elements grows (for this example I used 3 but I actually need larger values).

In addition to this I'd like to understand why using np.arange on the first index behaves different from slicing it as :: I didn't expect this behavior.

Any other comment to understand why previous approaches failed, are also welcome.

CodePudding user response：

You only have column indices, so you also need to create their corresponding row indices:

>>> my_set[np.arange(len(my_set))[:, None], init_values] = 1
>>> my_set
array([[False,  True, False,  True,  True],
       [ True,  True,  True, False, False]])

[:, None] is used to convert the row indices row vector to the column vector, so that row and column indices have compatible shapes for broadcasting:

>>> np.arange(len(my_set))[:, None]
array([[0],
       [1]])
>>> np.broadcast_arrays(np.arange(len(my_set))[:, None], init_values)
[array([[0, 0, 0],
        [1, 1, 1]]),
 array([[1, 3, 4],
        [0, 1, 2]], dtype=int64)]

The essence of slicing is to apply the index of other dimensions to each element in the slicing range of this dimension. Here is a simple test. The matrix to be indexed is as follows:

>>> ar = np.arange(4).reshape(2, 2)
>>> ar
array([[0, 1],
       [2, 3]])

If you want to get elements whit indices 0 and 1 in row 0, and elements with indices 1 and 0 in row 1, but you use the combination of column indices [[0, 1], [1, 0]] and slice, you will get:

>>> ar[:, [[0, 1], [1, 0]]]
array([[[0, 1],
        [1, 0]],

       [[2, 3],
        [3, 2]]])

This is equivalent to combining the row index from 0 to 1 with the column indices respectively:

>>> ar[0, [[0, 1], [1, 0]]]
array([[0, 1],
       [1, 0]])
>>> ar[1, [[0, 1], [1, 0]]]
array([[2, 3],
       [3, 2]])

In fact, broadcasting is used secretly here. The actual indices are:

>>> np.broadcast_arrays(0, [[0, 1], [1, 0]])
[array([[0, 0],
        [0, 0]]),
 array([[0, 1],
        [1, 0]])]
>>> np.broadcast_arrays(1, [[0, 1], [1, 0]])
[array([[1, 1],
        [1, 1]]),
 array([[0, 1],
        [1, 0]])]

This is not the same as the indices you actually need. Therefore, you need to manually generate the correct row indices for broadcasting:

>>> ar[[[0], [1]], [[0, 1], [1, 0]]]
array([[0, 1],
       [3, 2]])
>>> np.broadcast_arrays([[0], [1]], [[0, 1], [1, 0]])
[array([[0, 0],
        [1, 1]]),
 array([[0, 1],
        [1, 0]])]