Slicing a huuuge 2D numpy ndarray - How to do this efficiently?-CodePudding

I have a numpy array of size (24, 131000). The first of these 24 columns contains an index corresponding to a number in the range [0, 25], for each of the 131000 rows. I want to slice this array to generate 26 separate arrays, each containing the data corresponding to a single index.

I've tried this as follows:

for index in huge_array[0]:
    new_array = huge_array[:, huge_array[0] == index]
    np.save(...) # saves each of the 26 new arrays separately

Problem is this takes literal hours to be done. Does anyone know of better and more efficient ways to slice this array?

CodePudding user response：

You could try using argwhere, for example:

import numpy as np

# generate a "huge" array to use for my purposes
x = np.random.randn(24, 131000)

# set first row to contain indices between 0 and 25
x[0] = np.random.choice(26, size=len(x[0]))

# get unique values in first row (assuming you don't know they'll be between 0 and 25)
values = np.unique(x[0])

# loop over unique values
for value in values:
    # get indices that contain "value"
    idxs = np.argwhere(x[0] == value).flatten()

    # output required array (if you want to keep the shape, rather than
    # have a 1d array, remove the ".flatten()" 
    np.save("myfile.npy", x[1:, idxs].flatten())

This should be very quick. Without the saving part, i.e., just storing the new arrays happens in under a second on my machine.

CodePudding user response：

When you do

for index in huge_array[0]:

you're iterating over the whole first row of the data. A shape of (24, 131000) doesn't mean you have 24 columns. It means you have 24 rows and 131000 columns.

This loop executes 131 thousand iterations when it should have executed only 26. You shouldn't be looping over huge_array[0]. You should be looping over range(26): the integers from 0 to 25, inclusive.

for index in range(26):