I have boolean array of shape (n_samples, n_items)
which represents a set: my_set[i, j]
tells if sample i
contains item j
.
To populate it, the array is initialized as zeros, and receive another array of integers, with shape (n_samples, 3)
, telling for each example, three elements that belongs to it, for instance:
my_set = np.zeros((2, 5), dtype=bool)
init_values = np.array([[1,3,4], [0,1,2]], dtype=np.int64)
So, I need to fill my_set
in row 0
and columns 1, 3, 4
and in row 1
, columns 0, 1, 2
, with with ones.
my_set
contain valid values in appropriated range (that is, in [0, n_items)), and each column doesn't contain duplicated items.
Some failed approaches:
- I know that a list of integers (or array) can be used as index, so I tried to use
init_values
as index straightforward, but it failed:
my_set[init_values] = 1
File "<ipython-input-9-9b2c4d19f4f6>", line 1, in <cell line: 1>
my_set[init_values] = 1
IndexError: index 3 is out of bounds for axis 0 with size 2
- I don't know why the 3 is indexing over the first axis, so I tried a second approach: "pick up all rows and index only desired columns", using a mix of slicing and integer index. And it didn't throw error, but didn't worked as expected: checkout the shape, I expect it to be
(2, 3)
, however...
my_set[:, init_values].shape
Out[11]: (2, 2, 3)
- Not sure why it didn't work, but at least the first axis looks correct, so I tried to pick up only the first column, which is a list of integers, and therefore it is "more natural"... once again, it didn't worked:
my_set[:, init_values[:,0]].shape
Out[12]: (2, 2)
I expected this shape to be (2, 1)
since I wanted all rows with a single column on each, corresponding to the indexes given in init_values
.
- I decided to go back to integer index approach for the first axis.... and it worked:
my_set[np.arange(len(my_set)), init_values[:,0]].shape
Out[13]: (2,)
However, it only works wor one column, so I need to iterate over columns to make it really work, but it looks like a good-initial workaround.
Current solution
So, to solve my original problem, I wrote this:
for c in range(init_values.shape[1])
my_set[np.arange(len(my_set)), init_values[:,c]] = 1
# now lets check my_set is properly filled
print(my_set)
Out[14]: [[False True False True True]
[ True True True False False]]
which is exactly what I need.
Question(s):
That said, here goes my main question:
Is there a more efficient way to do this? I see it quite inefficient as the number of elements grows (for this example I used 3 but I actually need larger values).
In addition to this I'd like to understand why using np.arange
on the first index behaves different from slicing it as :
: I didn't expect this behavior.
Any other comment to understand why previous approaches failed, are also welcome.
CodePudding user response:
You only have column indices, so you also need to create their corresponding row indices:
>>> my_set[np.arange(len(my_set))[:, None], init_values] = 1
>>> my_set
array([[False, True, False, True, True],
[ True, True, True, False, False]])
[:, None]
is used to convert the row indices row vector to the column vector, so that row and column indices have compatible shapes for broadcasting:
>>> np.arange(len(my_set))[:, None]
array([[0],
[1]])
>>> np.broadcast_arrays(np.arange(len(my_set))[:, None], init_values)
[array([[0, 0, 0],
[1, 1, 1]]),
array([[1, 3, 4],
[0, 1, 2]], dtype=int64)]
The essence of slicing is to apply the index of other dimensions to each element in the slicing range of this dimension. Here is a simple test. The matrix to be indexed is as follows:
>>> ar = np.arange(4).reshape(2, 2)
>>> ar
array([[0, 1],
[2, 3]])
If you want to get elements whit indices 0 and 1 in row 0, and elements with indices 1 and 0 in row 1, but you use the combination of column indices [[0, 1], [1, 0]]
and slice, you will get:
>>> ar[:, [[0, 1], [1, 0]]]
array([[[0, 1],
[1, 0]],
[[2, 3],
[3, 2]]])
This is equivalent to combining the row index from 0 to 1 with the column indices respectively:
>>> ar[0, [[0, 1], [1, 0]]]
array([[0, 1],
[1, 0]])
>>> ar[1, [[0, 1], [1, 0]]]
array([[2, 3],
[3, 2]])
In fact, broadcasting is used secretly here. The actual indices are:
>>> np.broadcast_arrays(0, [[0, 1], [1, 0]])
[array([[0, 0],
[0, 0]]),
array([[0, 1],
[1, 0]])]
>>> np.broadcast_arrays(1, [[0, 1], [1, 0]])
[array([[1, 1],
[1, 1]]),
array([[0, 1],
[1, 0]])]
This is not the same as the indices you actually need. Therefore, you need to manually generate the correct row indices for broadcasting:
>>> ar[[[0], [1]], [[0, 1], [1, 0]]]
array([[0, 1],
[3, 2]])
>>> np.broadcast_arrays([[0], [1]], [[0, 1], [1, 0]])
[array([[0, 0],
[1, 1]]),
array([[0, 1],
[1, 0]])]