Home > Back-end >  Keeping track of processed lists
Keeping track of processed lists

Time:10-17

I have a workflow where I process lists, and keep track of which ones I've seen in a dictionary - because when I get to one that's been seen already, I can pass on it. The caveat is that two lists with the same elements ordered differently are considered duplicates. So to take a simple example of integer lists: [3, 1, 5, 6] and [5, 3, 6, 1] are equivalent.

How I've been doing this is, take tuple(sorted(L)) and put that in my dictionary. So for the above example seen looks like:

seen = {..., (1, 3, 5, 6): 1, ...}

This has the nice advantage of being constant time lookup for each list I potentially need to process. The problem however is that in order to check if a given list is in seen, I have to do tuple(sorted(L)) on it. And it turns out that for large amounts of data, this becomes prohibitive to the point of taking over 50% of the total time of my entire process.

I'd like to make use of collections.Counter somehow - because Counter[3, 1, 5, 6] and Counter[5, 3, 6, 1] would evaluate equal. But Counter objects can't be used as dict keys. How can I keep my dictionary lookup, but without doing the sorting operation indicated above? Thanks in advance.

EDIT: In looking at the suggestions to use frozenset I realized I left an important thing out, which is that elements can duplicate. So while the two lists above compare equal, [3, 1, 5, 6] and [3, 3, 1, 5, 6, 6] need to be seen as distinct. Since the frozenset operation would remove duplicates, it isn't an option here.

CodePudding user response:

As others have pointed out, frozenset is your best option:

seen = set()

list_1, list_2, list_3 = [3, 1, 5, 6], [5, 3, 6, 1], [1, 2, 3]

print(frozenset(list_1) in seen)
seen.add(frozenset(list_1))
print(frozenset(list_1) in seen)
print(frozenset(list_2) in seen)
print(frozenset(list_3) in seen)

frozenset takes as an argument an iterable so instantiating the set would take O(n). Lookups have on average O(1), same as standard set, according to this previous response. Either way, you would save the very expensive operation of sort which is O(n log n).

CodePudding user response:

If you do have a lot of duplicates, then maybe use frozenset of Counter items?

>>> from collections import Counter
>>> frozenset(Counter([3, 3, 1, 5, 6, 6]).items())
frozenset({(1, 1), (3, 2), (6, 2), (5, 1)})

Or sorted tuples of Counter items:

>>> tuple(sorted(Counter([3, 3, 1, 5, 6, 6]).items()))
((1, 1), (3, 2), (5, 1), (6, 2)) 
  • Related