Home > Enterprise >  Efficient way to pick first 'n' non-repeating elements in every row of a 2d numpy array
Efficient way to pick first 'n' non-repeating elements in every row of a 2d numpy array

Time:10-06

I have a 2d numpy array of integers and I want to pick the first 5 unique elements in every row.

a = np.array([[193,  64,  64, 139, 180, 180, 104, 152,  69,  22, 192,  92],
   [  1,  36, 156, 152,  152,  37,  46, 143, 141, 114,  25, 134],
   [110,  96,  52,  53,  35, 147,   3, 116,  20,  11, 137,   5]])

Notice the repeating elements in the first and second rows. The repeating elements appear next to each other. The output should be

array([[193, 64, 139, 180, 104], [1, 36, 156, 152, 37], [110, 96, 52, 53, 35]])

This is a sample array and the actual array has 20,000 rows. I'm looking for an efficient way to do this without the use of loops. Thanks in advance.

CodePudding user response:

Try with groupby:

from itertools import groupby
>>> np.array([np.array([k for k, g in groupby(row)])[:5] for row in a])
array([[193,  64, 139, 180, 104],
       [  1,  36, 156, 152,  37],
       [110,  96,  52,  53,  35]])

CodePudding user response:

Update

To get rid of the for loop (which I used because the program would be more efficient with a break statement in place), you can use the itertools.takewhile() method to act as a break statement within the list comprehension, thus making the program more efficient (I tested 2 versions of the code, one with the itertools.takewhile() method and one without; the former turned out faster):

import numpy as np
from itertools import groupby, takewhile

a = np.array([[193,  64,  64, 139, 180, 180, 104, 152,  69,  22, 192,  92],
              [  1,  36, 156, 152, 152,  37,  46, 143, 141, 114,  25, 134],
              [110,  96,  52,  53,  35, 147,   3, 116,  20,  11, 137,   5]])

result = [[k[0] for i, k in takewhile(lambda x: x[0] != 5, enumerate(groupby(row)))] for row in a]
print(np.array(result))

Output:

[[193  64 139 180 104]
 [  1  36 156 152  37]
 [110  96  52  53  35]]

(Using for loops)

You can try using the built-in enumerate() function along with the itertools.groupby() method:

import numpy as np
from itertools import groupby

a = np.array([[193,  64,  64, 139, 180, 180, 104, 152,  69,  22, 192,  92],
              [  1,  36, 156, 152, 152,  37,  46, 143, 141, 114,  25, 134],
              [110,  96,  52,  53,  35, 147,   3, 116,  20,  11, 137,   5]])

def get_unique(a, amt):
    for row in a:
        r = []
        for i, k in enumerate(groupby(row)):
            if i == amt:
                break
            r.append(k[0])
        yield r

for row in get_unique(a, 5):
    print(row)

Output:

[193, 64, 139, 180, 104]
[1, 36, 156, 152, 37]
[110, 96, 52, 53, 35]

Omitting the function:

import numpy as np
from itertools import groupby

a = np.array([[193,  64,  64, 139, 180, 180, 104, 152,  69,  22, 192,  92],
              [  1,  36, 156, 152, 152,  37,  46, 143, 141, 114,  25, 134],
              [110,  96,  52,  53,  35, 147,   3, 116,  20,  11, 137,   5]])

result = []
for row in a:
    r = []
    for i, k in enumerate(groupby(row)):
        if i == 5:
            break
        r.append(k[0])
    result.append(r)

print(np.array(result))

Output:

[[193  64 139 180 104]
 [  1  36 156 152  37]
 [110  96  52  53  35]]

CodePudding user response:

Try numpy.apply_along_axis itertools.groupby itertools.islice:

import numpy as np
from itertools import groupby, islice

a = np.array([[193, 64, 64, 139, 180, 180, 104, 152, 69, 22, 192, 92],
              [1, 36, 156, 152, 152, 37, 46, 143, 141, 114, 25, 134],
              [110, 96, 52, 53, 35, 147, 3, 116, 20, 11, 137, 5]])


first_5_unique = lambda x: [k for k, _ in islice(groupby(x), 5)]
res = np.apply_along_axis(first_5_unique, axis=1, arr=a)
print(res)

Output

[[193  64 139 180 104]
 [  1  36 156 152  37]
 [110  96  52  53  35]]

Or, a numpy only using numpy.argpartition and numpy.argsort:

def first_k_unique(arr, k, axis=1):
    val = (np.diff(arr) != 0) * np.arange(start=10, stop=-1, step=-1) * -1
    ind = np.argpartition(val, k, axis=axis)[:, :k]
    res = np.take_along_axis(arr, indices=ind, axis=axis)
    return np.take_along_axis(res, np.take_along_axis(val, indices=ind, axis=axis).argsort(axis), axis)


print(first_k_unique(a, 5))

Output

[[193  64 139 180 104]
 [  1  36 156 152  37]
 [110  96  52  53  35]]

The core explanation of the numpy only solution, can be found here.

CodePudding user response:

Using numpy alone you can vectorize the unique function but then it also needs to be padded, and also to preserve order. Then just get the first 5 columns of the result:

np.apply_along_axis(lambda x: np.pad(u := x[np.sort(np.unique(x, return_index=1)[1])], (0, a[0].size-u.size)), 1, a)[:,:5]

>>> a = np.array([[193,  64,  64, 139, 180, 180, 104, 152,  69,  22, 192,  92], [  1,  36, 156, 152,  152,  37,  46, 143, 141, 114,  25, 134], [110,  96,  52,  53,  35, 147,   3, 116,  20,  11, 137,   5]])
>>> np.apply_along_axis(lambda x: np.pad(u := x[np.sort(np.unique(x, return_index=1)[1])], (0, a[0].size-u.size)), 1, a)[:,:5]
[[193  64 139 180 104]
 [  1  36 156 152  37]
 [110  96  52  53  35]]
  • Related