I have a nested numpy array - it contains a lot of other numpy sub-arrays, but the sub-arrays have different lengths. The main array arr_main
looks something like this:
>>> print(main_arr)
array([[array([3.5525, ..., 4.0138, 4.0139], dtype=float32)],
[array([3.5525, ..., 4.0138, 4.0139], dtype=float32)],
...
[array([3.5525, ..., 4.0138, 4.0139], dtype=float32)]],
dtype=object)
What I want to do is to extract only the unique sub-arrays from the big, main array, so I want to do something like
np.unique(main_arr)
but this results in the error message ValueError: operands could not be broadcast together with shapes (4613,) (4615,)
. I guess, this is due to some sub-arrays having different lengths.
How can I extract the unique sub-arrays from main_arr
? If you know some solution that is not relying on numpy it will be also appreciated! tnx
CodePudding user response:
You can use a dictionnary regrouping arrays by length and then extracting only uniques ones.
for array in main_arr:
n = array.size
if n in d:
d[n].append(array)
else:
d[n] = np.array([array])
new_array = np.empty()
for k in d.keys():
new_array.append(np.unique(d[k]))
However, extracting unique arrays is a heavy algorithm...
CodePudding user response:
The numpy unique function on works on 1 dimensional arrays but here's some logic you could deploy to get an array of unique arrays:
import numpy as np
# Create example array of sub arrays
a = np.array([
np.array([1, 2, 3]), np.array([4, 5, 6, 7]),
np.array([1, 2, 3]), np.array([4, 5, 6, 7])])
# Build array of unique sub arrays
unique = []
for sub_a in a:
if not any([np.array_equal(i, sub_a) for i in unique]):
unique.append(sub_a)
unique_array = np.array(unique)
CodePudding user response:
You might like to think more which groups of items of main_arr
could be compared in order to identify whether they have duplicates. The question is self-explanatory. You need to group it by lenghts of arrays you've got in main_arr
. After that you can call np.unique
on these groups.
main_arr = np.array([np.array([3.5525, 3.7895, 4.0139], dtype=float),
np.array([3.5525, 3.7895, 4.0139], dtype=float),
np.array([3.5525, 4.0138, 4.0139, 4.1], dtype=float),
np.array([3.5525, 4.0138, 4.0139], dtype=float),
np.array([3.5525, 4.0138, 4.0139, 4.1], dtype=float),
np.array([3.5525, 3.7895, 4.0138, 4.0139, -1], dtype=float)], dtype=object)
from itertools import groupby
groups = [list(g) for k,g in groupby(sorted(main_arr, key=len), len)]
# (...) instead of [...] is a better choice in order to avoid double iteration
>>> groups
[[array([3.5525, 3.7895, 4.0139]),
array([3.5525, 3.7895, 4.0139]),
array([3.5525, 4.0138, 4.0139])],
[array([3.5525, 4.0138, 4.0139, 4.1 ]),
array([3.5525, 4.0138, 4.0139, 4.1 ])],
[array([ 3.5525, 3.7895, 4.0138, 4.0139, -1. ])]]
>>> [np.unique(g, axis=0) for g in groups]
[array([[3.5525, 3.7895, 4.0139],
[3.5525, 4.0138, 4.0139]]),
array([[3.5525, 4.0138, 4.0139, 4.1 ]]),
array([[ 3.5525, 3.7895, 4.0138, 4.0139, -1. ]])]
You are able to concatenate all these arrays but then you're going to have another data structure numpy
processing is not designed for if you do this.
Note: I've changed an initial data a little bit.