Home > Back-end >  One large numpy array (3mil rows on 5 columns) - how to pick rows that meet several conditions at th
One large numpy array (3mil rows on 5 columns) - how to pick rows that meet several conditions at th

Time:03-08

        def func(data):
            A = np.zeros([len(data), 5], np.int16)
            for i in range(len(data)):
                if(data[i, 1] >= -10 and data[i, 1] <= -13 and
                   data[i, 3] >= -20 and data[i, 3] <= -22):
                    A[i] = data[i]
                    
                elif(data[i, 1] >= -16 and data[i, 1] <= -19 and
                   data[i, 3] >= -24 and data[i, 3] <= --30):
                    A[i] = data[i]
                
                .... (for another similar 8 elif conditions)
                
                else:
                    continue

            return A[~np.all(A == 0, axis=1)]
        func(data)

Problem: I have a large NumPy array and I need to extract whole rows (not just index or its value) that meet those conditions. Code does run but it is very slow. It wouldn't be an issue but I have to read another 800 files, and then perform other tasks.

How can I optimise this function? Thank you in advance.

CodePudding user response:

My solution is very close to AJH one but I believe it is a bit simpler and you don't need to keep in memory a full size A frame. Not sure it changes much but it is a bit less memory intensive.

def func(data):
    condition_1 = ((data[:, 1] <= -10) & (data[:, 1] >= -13) & (data[:, 3] <= -20) & (data[:, 3] >= -22))
    condition_2 = ((data[:, 1] <= -16) & (data[:, 1] >= -19) & (data[:, 3] <= -24) & (data[:, 3] >= -30))
    mask = (condition_1 | condition_2)
    return data[mask]

Then just add all the conditions you need. For information & is and and | is or, while I find the full keywords easier to use, it actually doesn't work with numpy arrays.

CodePudding user response:

I've made a working version of something you could try. It gets rid of the for loop and doesn't require you to make a whole new array A that's the same size as data; in my version, A starts out with 0 rows and is then added to as needed, which should help save space. The function is also vectorised to some extent, meaning that I'm not manually iterating through every row.

def func(data):
    # The array A must have the same number of columns as the data array. If you know this beforehand, then just use that; otherwise, replace 5 with data.shape[1].
    A = np.zeros((0,5))

    # mask1 contains the indices where these conditions are all met. After the mask is found, the rows in data with indices in mask1 are essentially appended to A.
    mask1 = np.where((data[:,1] >= -10) & (data[:,1] <= -13) & (data[:,3] >= -20) & (data[:,3] <= -22))
    A = np.concatenate((A, data[mask1]), axis=0)

    # Do the same for all the other conditions.
    mask2 = np.where((data[:,1] <= -19) & (data[:,3] >= -24) & (data[:,3] <= -30))
    A = np.concatenate((A, data[mask2]), axis=0)

    .... (for all other conditions)

    return A

One more thing: I am a bit confused about how data[i, 1] >= -10 and data[i, 1] <= -13 can ever evaluate to True, since -13 is less than -10.Same with data[i, 3] >= -20 and data[i, 3] <= -22. Perhaps you flipped the signs by accident, or need to switch to >= and <= signs?

And you've got a typo (--30 instead of -30), just in case you missed that. I'm not trying to be nitpicky, I just don't want you getting stuck on why your code isn't working when you run it. I'm sorry if I'm coming off as mean, I'm not trying to be.

Anyway, let me know if you need clarification on anything!

P.S. I am so sorry for the side-scrolling; I'm not sure how to fix it.

  • Related