Efficient way to pick first 'n' non-repeating elements in every row of a 2d numpy array-CodePudding

I have a 2d numpy array of integers and I want to pick the first 5 unique elements in every row.

a = np.array([[193,  64,  64, 139, 180, 180, 104, 152,  69,  22, 192,  92],
   [  1,  36, 156, 152,  152,  37,  46, 143, 141, 114,  25, 134],
   [110,  96,  52,  53,  35, 147,   3, 116,  20,  11, 137,   5]])

Notice the repeating elements in the first and second rows. The repeating elements appear next to each other. The output should be

array([[193, 64, 139, 180, 104], [1, 36, 156, 152, 37], [110, 96, 52, 53, 35]])

This is a sample array and the actual array has 20,000 rows. I'm looking for an efficient way to do this without the use of loops. Thanks in advance.

CodePudding user response：

Try with groupby:

from itertools import groupby
>>> np.array([np.array([k for k, g in groupby(row)])[:5] for row in a])
array([[193,  64, 139, 180, 104],
       [  1,  36, 156, 152,  37],
       [110,  96,  52,  53,  35]])

CodePudding user response：

Update

To get rid of the for loop (which I used because the program would be more efficient with a break statement in place), you can use the itertools.takewhile() method to act as a break statement within the list comprehension, thus making the program more efficient (I tested 2 versions of the code, one with the itertools.takewhile() method and one without; the former turned out faster):

import numpy as np
from itertools import groupby, takewhile

a = np.array([[193,  64,  64, 139, 180, 180, 104, 152,  69,  22, 192,  92],
              [  1,  36, 156, 152, 152,  37,  46, 143, 141, 114,  25, 134],
              [110,  96,  52,  53,  35, 147,   3, 116,  20,  11, 137,   5]])

result = [[k[0] for i, k in takewhile(lambda x: x[0] != 5, enumerate(groupby(row)))] for row in a]
print(np.array(result))

Output:

[[193  64 139 180 104]
 [  1  36 156 152  37]
 [110  96  52  53  35]]

(Using for loops)

You can try using the built-in enumerate() function along with the itertools.groupby() method:

import numpy as np
from itertools import groupby

a = np.array([[193,  64,  64, 139, 180, 180, 104, 152,  69,  22, 192,  92],
              [  1,  36, 156, 152, 152,  37,  46, 143, 141, 114,  25, 134],
              [110,  96,  52,  53,  35, 147,   3, 116,  20,  11, 137,   5]])

def get_unique(a, amt):
    for row in a:
        r = []
        for i, k in enumerate(groupby(row)):
            if i == amt:
                break
            r.append(k[0])
        yield r

for row in get_unique(a, 5):
    print(row)

Output:

[193, 64, 139, 180, 104]
[1, 36, 156, 152, 37]
[110, 96, 52, 53, 35]

Omitting the function:

import numpy as np
from itertools import groupby

a = np.array([[193,  64,  64, 139, 180, 180, 104, 152,  69,  22, 192,  92],
              [  1,  36, 156, 152, 152,  37,  46, 143, 141, 114,  25, 134],
              [110,  96,  52,  53,  35, 147,   3, 116,  20,  11, 137,   5]])

result = []
for row in a:
    r = []
    for i, k in enumerate(groupby(row)):
        if i == 5:
            break
        r.append(k[0])
    result.append(r)

print(np.array(result))

Output:

[[193  64 139 180 104]
 [  1  36 156 152  37]
 [110  96  52  53  35]]

CodePudding user response：

Try numpy.apply_along_axis itertools.groupby itertools.islice:

import numpy as np
from itertools import groupby, islice

a = np.array([[193, 64, 64, 139, 180, 180, 104, 152, 69, 22, 192, 92],
              [1, 36, 156, 152, 152, 37, 46, 143, 141, 114, 25, 134],
              [110, 96, 52, 53, 35, 147, 3, 116, 20, 11, 137, 5]])


first_5_unique = lambda x: [k for k, _ in islice(groupby(x), 5)]
res = np.apply_along_axis(first_5_unique, axis=1, arr=a)
print(res)

Output

[[193  64 139 180 104]
 [  1  36 156 152  37]
 [110  96  52  53  35]]

Or, a numpy only using numpy.argpartition and numpy.argsort:

def first_k_unique(arr, k, axis=1):
    val = (np.diff(arr) != 0) * np.arange(start=10, stop=-1, step=-1) * -1
    ind = np.argpartition(val, k, axis=axis)[:, :k]
    res = np.take_along_axis(arr, indices=ind, axis=axis)
    return np.take_along_axis(res, np.take_along_axis(val, indices=ind, axis=axis).argsort(axis), axis)


print(first_k_unique(a, 5))

Output

[[193  64 139 180 104]
 [  1  36 156 152  37]
 [110  96  52  53  35]]

The core explanation of the numpy only solution, can be found here.

CodePudding user response：

Using numpy alone you can vectorize the unique function but then it also needs to be padded, and also to preserve order. Then just get the first 5 columns of the result:

np.apply_along_axis(lambda x: np.pad(u := x[np.sort(np.unique(x, return_index=1)[1])], (0, a[0].size-u.size)), 1, a)[:,:5]

>>> a = np.array([[193,  64,  64, 139, 180, 180, 104, 152,  69,  22, 192,  92], [  1,  36, 156, 152,  152,  37,  46, 143, 141, 114,  25, 134], [110,  96,  52,  53,  35, 147,   3, 116,  20,  11, 137,   5]])
>>> np.apply_along_axis(lambda x: np.pad(u := x[np.sort(np.unique(x, return_index=1)[1])], (0, a[0].size-u.size)), 1, a)[:,:5]
[[193  64 139 180 104]
 [  1  36 156 152  37]
 [110  96  52  53  35]]