I have a 2d numpy array of integers and I want to pick the first 5 unique elements in every row.
a = np.array([[193, 64, 64, 139, 180, 180, 104, 152, 69, 22, 192, 92],
[ 1, 36, 156, 152, 152, 37, 46, 143, 141, 114, 25, 134],
[110, 96, 52, 53, 35, 147, 3, 116, 20, 11, 137, 5]])
Notice the repeating elements in the first and second rows. The repeating elements appear next to each other. The output should be
array([[193, 64, 139, 180, 104], [1, 36, 156, 152, 37], [110, 96, 52, 53, 35]])
This is a sample array and the actual array has 20,000 rows. I'm looking for an efficient way to do this without the use of loops. Thanks in advance.
CodePudding user response:
Try with groupby
:
from itertools import groupby
>>> np.array([np.array([k for k, g in groupby(row)])[:5] for row in a])
array([[193, 64, 139, 180, 104],
[ 1, 36, 156, 152, 37],
[110, 96, 52, 53, 35]])
CodePudding user response:
Update
To get rid of the for
loop (which I used because the program would be more efficient with a break
statement in place), you can use the itertools.takewhile()
method to act as a break
statement within the list comprehension, thus making the program more efficient (I tested 2 versions of the code, one with the itertools.takewhile()
method and one without; the former turned out faster):
import numpy as np
from itertools import groupby, takewhile
a = np.array([[193, 64, 64, 139, 180, 180, 104, 152, 69, 22, 192, 92],
[ 1, 36, 156, 152, 152, 37, 46, 143, 141, 114, 25, 134],
[110, 96, 52, 53, 35, 147, 3, 116, 20, 11, 137, 5]])
result = [[k[0] for i, k in takewhile(lambda x: x[0] != 5, enumerate(groupby(row)))] for row in a]
print(np.array(result))
Output:
[[193 64 139 180 104]
[ 1 36 156 152 37]
[110 96 52 53 35]]
(Using for
loops)
You can try using the built-in enumerate()
function along with the itertools.groupby()
method:
import numpy as np
from itertools import groupby
a = np.array([[193, 64, 64, 139, 180, 180, 104, 152, 69, 22, 192, 92],
[ 1, 36, 156, 152, 152, 37, 46, 143, 141, 114, 25, 134],
[110, 96, 52, 53, 35, 147, 3, 116, 20, 11, 137, 5]])
def get_unique(a, amt):
for row in a:
r = []
for i, k in enumerate(groupby(row)):
if i == amt:
break
r.append(k[0])
yield r
for row in get_unique(a, 5):
print(row)
Output:
[193, 64, 139, 180, 104]
[1, 36, 156, 152, 37]
[110, 96, 52, 53, 35]
Omitting the function:
import numpy as np
from itertools import groupby
a = np.array([[193, 64, 64, 139, 180, 180, 104, 152, 69, 22, 192, 92],
[ 1, 36, 156, 152, 152, 37, 46, 143, 141, 114, 25, 134],
[110, 96, 52, 53, 35, 147, 3, 116, 20, 11, 137, 5]])
result = []
for row in a:
r = []
for i, k in enumerate(groupby(row)):
if i == 5:
break
r.append(k[0])
result.append(r)
print(np.array(result))
Output:
[[193 64 139 180 104]
[ 1 36 156 152 37]
[110 96 52 53 35]]
CodePudding user response:
Try numpy.apply_along_axis
itertools.groupby
itertools.islice
:
import numpy as np
from itertools import groupby, islice
a = np.array([[193, 64, 64, 139, 180, 180, 104, 152, 69, 22, 192, 92],
[1, 36, 156, 152, 152, 37, 46, 143, 141, 114, 25, 134],
[110, 96, 52, 53, 35, 147, 3, 116, 20, 11, 137, 5]])
first_5_unique = lambda x: [k for k, _ in islice(groupby(x), 5)]
res = np.apply_along_axis(first_5_unique, axis=1, arr=a)
print(res)
Output
[[193 64 139 180 104]
[ 1 36 156 152 37]
[110 96 52 53 35]]
Or, a numpy only using numpy.argpartition
and numpy.argsort
:
def first_k_unique(arr, k, axis=1):
val = (np.diff(arr) != 0) * np.arange(start=10, stop=-1, step=-1) * -1
ind = np.argpartition(val, k, axis=axis)[:, :k]
res = np.take_along_axis(arr, indices=ind, axis=axis)
return np.take_along_axis(res, np.take_along_axis(val, indices=ind, axis=axis).argsort(axis), axis)
print(first_k_unique(a, 5))
Output
[[193 64 139 180 104]
[ 1 36 156 152 37]
[110 96 52 53 35]]
The core explanation of the numpy only solution, can be found here.
CodePudding user response:
Using numpy alone you can vectorize the unique function but then it also needs to be padded, and also to preserve order. Then just get the first 5 columns of the result:
np.apply_along_axis(lambda x: np.pad(u := x[np.sort(np.unique(x, return_index=1)[1])], (0, a[0].size-u.size)), 1, a)[:,:5]
>>> a = np.array([[193, 64, 64, 139, 180, 180, 104, 152, 69, 22, 192, 92], [ 1, 36, 156, 152, 152, 37, 46, 143, 141, 114, 25, 134], [110, 96, 52, 53, 35, 147, 3, 116, 20, 11, 137, 5]])
>>> np.apply_along_axis(lambda x: np.pad(u := x[np.sort(np.unique(x, return_index=1)[1])], (0, a[0].size-u.size)), 1, a)[:,:5]
[[193 64 139 180 104]
[ 1 36 156 152 37]
[110 96 52 53 35]]