Efficient way of selecting elements of an array based on values from another array in Python?-CodePudding

I have two arrays, e.g. one is labels another is distances:

labels= array([3, 1, 0, 1, 3, 2, 3, 2, 1, 1, 3, 1, 2, 1, 3, 2, 2, 3, 3, 3, 2, 3,
        0, 3, 3, 2, 3, 2, 3, 2,...])

distances = array([2.32284095, 0.36254613, 0.95734965, 0.35429638, 2.79098656,
        5.45921793, 2.63795657, 1.34516461, 1.34028463, 1.10808795,
        1.60549826, 1.42531201, 1.16280383, 1.22517273, 4.48511033,
        0.71543217, 0.98840598,...])

What I want to do is to group the values from distances into N arrays based on the amount of unique label values (in this case N=4). So all values with label = 3 go in one array with label = 2 in another and so on.

I can think of simple brute force with loops and if-conditions but this will incur serious slowdown on large arrays. I feel that there are better ways of doing this by using either native list comprehension or numpy, or something else, just not sure what. What would be best, most efficient approaches?

"Brute force" example for reference, note:(len(labels)==len(distances)):

all_distance_arrays = []
for id in np.unique(labels):

   sorted_distances = []
   
   for index in range(len(labels)):

        if id == labels[index]:

          sorted_distances.append(distances[index])
    
   all_distance_arrays.append(sorted_distances)

CodePudding user response：

A simple list comprehension will be nice and fast:

groups = [distances[labels == i] for i in np.unique(labels)]

Output:

>>> groups
[array([0.95734965]),
 array([0.36254613, 0.35429638, 1.34028463, 1.10808795, 1.42531201,
        1.22517273]),
 array([5.45921793, 1.34516461, 1.16280383, 0.71543217, 0.98840598]),
 array([2.32284095, 2.79098656, 2.63795657, 1.60549826, 4.48511033])]

CodePudding user response：

By using just NumPy as:

_, counts = np.unique(labels, return_counts=True)  # counts is the repeatation number of each index
sor = labels.argsort()
sections = np.cumsum(counts)                       # end index of slices
labels_sor = np.split(labels[sor], sections)[:-1]
distances_sor = np.split(distances[sor], sections)[:-1]

CodePudding user response：

"Brute force" seems likely to be adequate with a reasonable number of labels:

from collections import defaultdict

dist_group = defaultdict(list)
for lb, ds in zip(labels, distances):
    dist_group[lb].append(ds)

It's hard to tell why this would not fit your purposes.

CodePudding user response：

You can do this with numpy functions only. First sort the arrays in lockstep (which is what np.unique does under the hood anyway), then split them where the label changes:

i = np.argsort(labels)
labels = labels[i]
distances = distances[i]
splitpoints = np.flatnonzero(np.diff(labels))   1
result = np.split(distances, splitpoints)
unique_labels = labels[np.r_[0, split_points]]