What does this indexing mean in numpy?-CodePudding

I saw this code by someone and I just can't understand what it means.

I have a matrix that is 4000x32, this matrix gets indexed by a tuple with 2 values, first value is an array with values from 0 to 3999, so a 4000x1 array. Then, my second value is a matrix 7x4000. This looks like this

data_padded #4000x32 matrix
>>> array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])


data.shape[0]
>>> 4003

pm_intervals_padded # 7x4000 matrix
>>> array([[ 1,  1,  4, ...,  1,  2,  1],
       [ 2,  2,  5, ...,  2,  3,  2],
       [ 3,  3,  6, ...,  3,  4,  3],
       ...,
       [ 5,  5,  8, ...,  5,  6,  5],
       [ 6,  6,  9, ...,  6,  7,  6],
       [ 7,  7, 10, ...,  7,  8,  7]], dtype=int64)



index_arrays = np.arange(data.shape[0]), pm_intervals_padded

index_arrays # it is now a tuple with a 4000x1 vector and a 7x4000 matrix
>>> (array([   0,    1,    2, ..., 4000, 4001, 4002]),
     array([[ 1,  1,  4, ...,  1,  2,  1],
        [ 2,  2,  5, ...,  2,  3,  2],
        [ 3,  3,  6, ...,  3,  4,  3],
        ...,
        [ 5,  5,  8, ...,  5,  6,  5],
        [ 6,  6,  9, ...,  6,  7,  6],
        [ 7,  7, 10, ...,  7,  8,  7]], dtype=int64))

Now the actual indexing performed

max_pm = data_padded[index_arrays]

max_pm # 7x4000 matrix
>>> array([[0.        , 0.        , 0.12076883, ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.11525869, ..., 0.        , 0.13826102,
        0.        ],
       [0.1493025 , 0.03919184, 0.07849565, ..., 0.08812743, 0.13011599,
        0.12065721],
       ...,
       [0.1001403 , 0.14246948, 0.06306174, ..., 0.12461658, 0.10053093,
        0.22260186],
       [0.12709181, 0.11613311, 0.08537152, ..., 0.08284497, 0.05215555,
        0.09772167],
       [0.10410622, 0.08596166, 0.02676092, ..., 0.09114279, 0.07044313,
        0.05734969]])

I really don't understand how "data_padded" changed from a 4000x32 matrix to a 7x4000 matrix.

I know that I should put my attempt to this, but honestly I have no clue. I know that the first element of the tuple is basically saying to retrieve all the rows of "data_padded", but beyond that I have no clue.

CodePudding user response：

It is numpy doing broadcasting and indexing. Take this simple example:

# 2x2
data = np.array([[1,2],
                 [3,4]])

index_arrays = (np.array([0,1]), np.array([[0,1],
                                           [0,0],
                                           [0,1]]))

data[index_arrays]

Numpy will first broadcast np.array([0,1]) to match the latter, so index_arrays becomes:

index_arrays = (np.array([[0,1],
                          [0,1],
                          [0,1]]), 
                np.array([[0,1],
                          [0,0],
                          [0,1]]))

then numpy simply retrieves elements from data for each index pair (i, j), with i from the first array in index_arrays and j from the second, first retrieve element at coordinate (0,0), then (1,1), (0,0), (1,0), etc. So the return would be of the same size as your index array:

array([[1, 4],
       [1, 3],
       [1, 4]])

CodePudding user response：

np.arange(data.shape[0])

is (4000,) shape (not 1x4000).

pm_intervals_padded   # (7,4000)

These broadcast together to select (7,4000) points. For the purpose of broadcasting (4000,) and (1,4000) behave the same. broadcasting is a fundamental numpy behavior, applicable to operations like addition as well as this indexing.

Indexing with the tuple (x,y) is equivalent to:

data_padded[(x, y)]
data_padded[x, y]

alternatively you could use

data_padded(np.arange(4000)[:,None], pm_intervals_padded.T]

This indexes with a (4000,1) and (4000,7), resulting in a (4000,7) selection.