I saw this code by someone and I just can't understand what it means.
I have a matrix that is 4000x32, this matrix gets indexed by a tuple with 2 values, first value is an array with values from 0 to 3999, so a 4000x1 array. Then, my second value is a matrix 7x4000. This looks like this
data_padded #4000x32 matrix
>>> array([[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
...,
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.]])
data.shape[0]
>>> 4003
pm_intervals_padded # 7x4000 matrix
>>> array([[ 1, 1, 4, ..., 1, 2, 1],
[ 2, 2, 5, ..., 2, 3, 2],
[ 3, 3, 6, ..., 3, 4, 3],
...,
[ 5, 5, 8, ..., 5, 6, 5],
[ 6, 6, 9, ..., 6, 7, 6],
[ 7, 7, 10, ..., 7, 8, 7]], dtype=int64)
index_arrays = np.arange(data.shape[0]), pm_intervals_padded
index_arrays # it is now a tuple with a 4000x1 vector and a 7x4000 matrix
>>> (array([ 0, 1, 2, ..., 4000, 4001, 4002]),
array([[ 1, 1, 4, ..., 1, 2, 1],
[ 2, 2, 5, ..., 2, 3, 2],
[ 3, 3, 6, ..., 3, 4, 3],
...,
[ 5, 5, 8, ..., 5, 6, 5],
[ 6, 6, 9, ..., 6, 7, 6],
[ 7, 7, 10, ..., 7, 8, 7]], dtype=int64))
Now the actual indexing performed
max_pm = data_padded[index_arrays]
max_pm # 7x4000 matrix
>>> array([[0. , 0. , 0.12076883, ..., 0. , 0. ,
0. ],
[0. , 0. , 0.11525869, ..., 0. , 0.13826102,
0. ],
[0.1493025 , 0.03919184, 0.07849565, ..., 0.08812743, 0.13011599,
0.12065721],
...,
[0.1001403 , 0.14246948, 0.06306174, ..., 0.12461658, 0.10053093,
0.22260186],
[0.12709181, 0.11613311, 0.08537152, ..., 0.08284497, 0.05215555,
0.09772167],
[0.10410622, 0.08596166, 0.02676092, ..., 0.09114279, 0.07044313,
0.05734969]])
I really don't understand how "data_padded" changed from a 4000x32 matrix to a 7x4000 matrix.
I know that I should put my attempt to this, but honestly I have no clue. I know that the first element of the tuple is basically saying to retrieve all the rows of "data_padded", but beyond that I have no clue.
CodePudding user response:
It is numpy doing broadcasting and indexing. Take this simple example:
# 2x2
data = np.array([[1,2],
[3,4]])
index_arrays = (np.array([0,1]), np.array([[0,1],
[0,0],
[0,1]]))
data[index_arrays]
Numpy will first broadcast np.array([0,1])
to match the latter, so index_arrays
becomes:
index_arrays = (np.array([[0,1],
[0,1],
[0,1]]),
np.array([[0,1],
[0,0],
[0,1]]))
then numpy simply retrieves elements from data
for each index pair (i, j), with i from the first array in index_arrays
and j from the second, first retrieve
element at coordinate (0,0)
, then (1,1)
, (0,0)
, (1,0)
, etc.
So the return would be of the same size as your index array:
array([[1, 4],
[1, 3],
[1, 4]])
CodePudding user response:
np.arange(data.shape[0])
is (4000,) shape (not 1x4000).
pm_intervals_padded # (7,4000)
These broadcast
together to select (7,4000) points. For the purpose of broadcasting
(4000,) and (1,4000) behave the same. broadcasting
is a fundamental numpy
behavior, applicable to operations like addition as well as this indexing.
Indexing with the tuple (x,y)
is equivalent to:
data_padded[(x, y)]
data_padded[x, y]
alternatively you could use
data_padded(np.arange(4000)[:,None], pm_intervals_padded.T]
This indexes with a (4000,1) and (4000,7), resulting in a (4000,7) selection.