How to get unique elements from a numpy array containing numpy arrays with different lengths?-CodePudding

I have a nested numpy array - it contains a lot of other numpy sub-arrays, but the sub-arrays have different lengths. The main array arr_main looks something like this:

>>> print(main_arr)
array([[array([3.5525, ..., 4.0138, 4.0139], dtype=float32)],
       [array([3.5525, ..., 4.0138, 4.0139], dtype=float32)],
                                ...
       [array([3.5525, ..., 4.0138, 4.0139], dtype=float32)]],
  dtype=object)

What I want to do is to extract only the unique sub-arrays from the big, main array, so I want to do something like

np.unique(main_arr)

but this results in the error message ValueError: operands could not be broadcast together with shapes (4613,) (4615,). I guess, this is due to some sub-arrays having different lengths.

How can I extract the unique sub-arrays from main_arr? If you know some solution that is not relying on numpy it will be also appreciated! tnx

CodePudding user response：

You can use a dictionnary regrouping arrays by length and then extracting only uniques ones.

for array in main_arr:
    n = array.size
    if n in d:
        d[n].append(array)
     else:
        d[n] = np.array([array])

new_array = np.empty()

for k in d.keys():
    new_array.append(np.unique(d[k]))

However, extracting unique arrays is a heavy algorithm...

CodePudding user response：

The numpy unique function on works on 1 dimensional arrays but here's some logic you could deploy to get an array of unique arrays:

import numpy as np

# Create example array of sub arrays
a = np.array([ 
    np.array([1, 2, 3]), np.array([4, 5, 6, 7]), 
    np.array([1, 2, 3]), np.array([4, 5, 6, 7])])
# Build array of unique sub arrays
unique = []
for sub_a in a: 
    if not any([np.array_equal(i, sub_a) for i in unique]): 
        unique.append(sub_a)
unique_array = np.array(unique)

CodePudding user response：

You might like to think more which groups of items of main_arr could be compared in order to identify whether they have duplicates. The question is self-explanatory. You need to group it by lenghts of arrays you've got in main_arr. After that you can call np.unique on these groups.

main_arr = np.array([np.array([3.5525, 3.7895, 4.0139], dtype=float),
              np.array([3.5525, 3.7895, 4.0139], dtype=float),
              np.array([3.5525, 4.0138, 4.0139, 4.1], dtype=float),
              np.array([3.5525, 4.0138, 4.0139], dtype=float),
              np.array([3.5525, 4.0138, 4.0139, 4.1], dtype=float),
              np.array([3.5525, 3.7895, 4.0138, 4.0139, -1], dtype=float)], dtype=object)

from itertools import groupby
groups = [list(g) for k,g in groupby(sorted(main_arr, key=len), len)] 
# (...) instead of [...] is a better choice in order to avoid double iteration
>>> groups
[[array([3.5525, 3.7895, 4.0139]),
  array([3.5525, 3.7895, 4.0139]),
  array([3.5525, 4.0138, 4.0139])],
 [array([3.5525, 4.0138, 4.0139, 4.1   ]),
  array([3.5525, 4.0138, 4.0139, 4.1   ])],
 [array([ 3.5525,  3.7895,  4.0138,  4.0139, -1.    ])]]

>>> [np.unique(g, axis=0) for g in groups]
[array([[3.5525, 3.7895, 4.0139],
        [3.5525, 4.0138, 4.0139]]),
 array([[3.5525, 4.0138, 4.0139, 4.1   ]]),
 array([[ 3.5525,  3.7895,  4.0138,  4.0139, -1.    ]])]

You are able to concatenate all these arrays but then you're going to have another data structure numpy processing is not designed for if you do this.

Note: I've changed an initial data a little bit.