I got a list (I call chunks) with len(chunks)=195
and len(chunks[0]) = 32
. The elements inside chunks[0]
are of type numpy.ndarray
and shape (9,103).
type(chunks[0][0])
<class 'numpy.ndarray'>
type(chunks[0][0][0])
<class 'numpy.ndarray'>
type(chunks[0][0][0][0])
<class 'numpy.float64'>
I'm trying to find if there are duplicates in chunks[0]
. The most appropriate way I thought of was len(chunks[0]) != set(chunks[0])
but that throws an error: 'TypeError: unhashable type'.
Is there another workable way to investigate whether elements inside the chunks[0]
are equal and if so to eliminate the duplicates from the list? Could transforming them to tensors be advisable to check for duplicates in a fast way?
CodePudding user response:
The problem
Hashables data types, i.e., those that can be used as elements in sets or keys in dicts, have to be immutable. That's because you have to get the same hash value each time you try to look for it, but if you could modify it, the hash value would change. For example, lists and arrays can be changed and are therefore not hashable, but tuples are immutable so they are hashable.
One possible solution
You can create a tuple containing the values from your list or array or list of arrays, and use that in your set.
Sample code
You could use functions like these to solve your problem:
def 2d_array_to_tuples(a):
return tuple(tuple(row) for row in a)
def list_of_2d_arrays_to_tuples(a_list):
return tuple(2d_array_to_typles(a) for a in a_list)
These two functions return "2D" and "3D" tuples, which are hashable. You can insert their return values into sets.
And then this could work to detect if two chunks contain the same 32 arrays in the same order:
len(chunks) != len(set(list_of_2d_arrays_to_tuples(chunk) for chunk in chunks))
Or if you want to look for duplicate arrays within chunks[0]
:
len(chunks[0]) != len(set(2d_array_to_tuples(a) for a in chunks[0]))
Eliminating the duplicates
If you want to eliminate the duplicates in the list, I would unroll those code a bit. Let chunk = chunks[0]
and say you want uniq_chunk
to have the arrays from chunk
without the duplicates. This code should do the trick:
found = set()
uniq_chunk = []
for a in chunk:
as_tuple = 2d_array_to_tuples(a)
if as_tuple not in found:
found.add(as_tuple)
uniq_chunk.append(a)
You can adjust this approach to the exact thing you're trying to deduplicate.