Is there a numpy way of looping/getting sub arrays of an array to extract info?-CodePudding

First of all, thank you for the time you took to answer me.

To give a little example, I have a huge dataset (n instances, 3 features) like that:

data = np.array([[7.0, 2.5, 3.1], [4.3, 8.8, 6.2], [1.1, 5.5, 9.9]])

It's labeled in another array:

label = np.array([0, 1, 0])

Questions:

I know that I can solve my problem by looping python like (for loop) but I'm concerned about a numpy way (without for-loop) to be less time consumption (do it as fast as possible).
If there aren't a way without for-loop, what would be the best one (M1, M2, any other wizardry method?)?.

My solution:

clusters = []
for lab in range(label.max() 1):
    # M1: creating new object
    c = data[label == lab]
    clusters.append([c.min(axis=0), c.max(axis=0)])

    # M2: comparing multiple times (called views?)
    # clusters.append([data[label == lab].min(axis=0), data[label == lab].max(axis=0)])

print(clusters)
# [[array([1.1, 2.5, 3.1]), array([7. , 5.5, 9.9])], [array([4.3, 8.8, 6.2]), array([4.3, 8.8, 6.2])]]

CodePudding user response：

You could start from and easier variant of this problem:

Given arr and its label, could you find a minimum and maximum values of arr items in each group of labels?

For instance:

arr = np.array([55,  7, 49, 65, 46, 75,  4, 54, 43, 54])
label = np.array([1, 3, 2, 0, 0, 2, 1, 1, 1, 2])

Then you would expect that minimum and maximum values of arr in each label group were:

min_values = np.array([46,  4, 49,  7])
max_values = np.array([65, 55, 75,  7])

Here is a numpy approach to this kind of problem:

def groupby_minmax(arr, label, return_groups=False):
    arg_idx = np.argsort(label)
    arr_sort = arr[arg_idx] 
    label_sort = label[arg_idx]
    div_points = np.r_[0, np.flatnonzero(np.diff(label_sort))   1]
    min_values = np.minimum.reduceat(arr_sort, div_points)
    max_values = np.maximum.reduceat(arr_sort, div_points)
    if return_groups:
        return min_values, max_values, label_sort[div_points]
    else: 
        return min_values, max_values

Now there's not much to change in order to adapt it to your use case:

def groupby_minmax_OP(arr, label, return_groups=False):
    arg_idx = np.argsort(label)
    arr_sort = arr[arg_idx] 
    label_sort = label[arg_idx]
    div_points = np.r_[0, np.flatnonzero(np.diff(label_sort))   1]
    min_values = np.minimum.reduceat(arr_sort, div_points, axis=0)
    max_values = np.maximum.reduceat(arr_sort, div_points, axis=0)
    if return_groups:
        return min_values, max_values, label_sort[div_points]
    else: 
        return np.array([min_values, max_values]).swapaxes(0, 1)
    
groupby_minmax(data, label)

Output:

array([[[1.1, 2.5, 3.1],
        [7. , 5.5, 9.9]],

       [[4.3, 8.8, 6.2],
        [4.3, 8.8, 6.2]]])

CodePudding user response：

it has already been answered, you can go to this link for your answer python numpy access list of arrays without for loop