I have a workflow where I process lists, and keep track of which ones I've seen in a dictionary - because when I get to one that's been seen already, I can pass on it. The caveat is that two lists with the same elements ordered differently are considered duplicates. So to take a simple example of integer lists: [3, 1, 5, 6]
and [5, 3, 6, 1]
are equivalent.
How I've been doing this is, take tuple(sorted(L))
and put that in my dictionary. So for the above example seen
looks like:
seen = {..., (1, 3, 5, 6): 1, ...}
This has the nice advantage of being constant time lookup for each list I potentially need to process. The problem however is that in order to check if a given list is in seen
, I have to do tuple(sorted(L))
on it. And it turns out that for large amounts of data, this becomes prohibitive to the point of taking over 50% of the total time of my entire process.
I'd like to make use of collections.Counter
somehow - because Counter[3, 1, 5, 6]
and Counter[5, 3, 6, 1]
would evaluate equal. But Counter objects can't be used as dict keys. How can I keep my dictionary lookup, but without doing the sorting operation indicated above? Thanks in advance.
EDIT: In looking at the suggestions to use frozenset
I realized I left an important thing out, which is that elements can duplicate. So while the two lists above compare equal, [3, 1, 5, 6]
and [3, 3, 1, 5, 6, 6]
need to be seen as distinct. Since the frozenset
operation would remove duplicates, it isn't an option here.
CodePudding user response:
As others have pointed out, frozenset is your best option:
seen = set()
list_1, list_2, list_3 = [3, 1, 5, 6], [5, 3, 6, 1], [1, 2, 3]
print(frozenset(list_1) in seen)
seen.add(frozenset(list_1))
print(frozenset(list_1) in seen)
print(frozenset(list_2) in seen)
print(frozenset(list_3) in seen)
frozenset
takes as an argument an iterable so instantiating the set would take O(n). Lookups have on average O(1), same as standard set
, according to this previous response. Either way, you would save the very expensive operation of sort which is O(n log n).
CodePudding user response:
If you do have a lot of duplicates, then maybe use frozenset
of Counter
items?
>>> from collections import Counter
>>> frozenset(Counter([3, 3, 1, 5, 6, 6]).items())
frozenset({(1, 1), (3, 2), (6, 2), (5, 1)})
Or sorted tuples of Counter
items:
>>> tuple(sorted(Counter([3, 3, 1, 5, 6, 6]).items()))
((1, 1), (3, 2), (5, 1), (6, 2))