The title is not so clear, because I cannot put my problem in a sentence (If you have a better title for this question, please suggest). I'll try to clarify my requirement with an example:
Suppose I have a table like this:
| Origin | Destination | Airline | Free Baggage |
===================================================
| NYC | London | American | 20KG |
---------------------------------------------------
| NYC | * | Southwest | 30KG |
---------------------------------------------------
| * | * | Southwest | 25KG |
---------------------------------------------------
| * | LA | * | 20KG |
---------------------------------------------------
| * | * | * | 15KG |
---------------------------------------------------
and so on ...
This table describes free baggage amount that the airlines provide in different routes. You can see that some rows have *
value, meaning that they match all possible values (those values are not known necessarily).
So we have a large list of baggage rules (like the table above) and a large list of flights (which their origin, destination and airline is known), and we intend to find the baggage amount for each one of flights in the most efficient way (iterating the list is not an efficient way, obviously, as it will cost an O(N)
computation). It is possible to exist more than one result for each flight, but we will assume that in this case either the first matching or the most specific one will be preferred (whichever is simpler for you to continue with).
If there was not *
signs in the table, the problem would be easy, and we could use a Hashmap
or Dictionary
with a Tuple
of values as a key. But with presence of those *
(lets say match-all) keys, it is not so straight forward to provide a general solution for that.
Please note that the above example was just an example, and I need a solution that can be used for any number of keys, not just three.
Do you have any idea or implementation for this problem, with a lookup method having time complexity equal or close to O(1)
like a regular hashmap (memory will not be an issue)? What would be the best possible solution?
CodePudding user response:
Regarding the comments, the more I think about it, and the more it looks like a relational database with indexes rather than an hashmap...
A trivial, quite easy solution could be something like an In-memory SQlite database. But it would probably be something in O(log2(n))
, and not O(1)
. The main advantage is that it's easy to set up, and IF performances are good enough, it could be the final solution.
Here, key is to use proper indexes, the LIKE
operator, and of course well-defined JOIN
clauses.
From scratch, I can't think about any solution that, having N
rows and M
columns, isn't at least in O(M)
... But usually, you'll have way less columns than rows. Quickly - I may have skipped a detail, I write that on-the-fly - I can propose you this algorithm / container:
Data must be stored in a vector-like container
VECDATA
, accessed by a simple index inO(1)
. Think about this as a primary key in databases, and we'll call itPK
. KnowingPK
gives you instantly, inO(1)
, the required data. You'll haveN
rows grand total.For each row NOT containing any
*
, you'll insert in a real hashmap calledMAINHASH
the pair(<tuple>, PK)
. This is your primary index, for exact results. It will be inO(1)
, BUT what you requested may not be within... Obviously, you must maintain consistency betweenMAINHASH
andVECDATA
, with whatever is needed (mutexes, locks, don't care as long as both are consistents). This hash contains at mostN
entries. Without any joker, it will act near as a standard hashmap, but for the indirection toVECDATA
. It's stillO(1)
in this case.For each searchable column, you'll build a specific index, dedicated to this column. The index has
N
entries. It will be a standard hashmap, but it MUST allow multiple values for a given key. That's quite a common container, so it shouldn't be an issue. For each row, the index entry will be:( <VECDATA value>, PK )
. The container is stored in a vector of indexes,INDEX[i]
(with0<=i<M
). Same asMAINHASH
, consistency must be enforced.
Obviously, all these indexes / subcontainers should be constructed when an entry is inserted into VECDATA
, and saved on disk across sessions if needed - you don't want to reconstruct all this each time you start the application...
Searching a row
So, user search for a given tuple.
Search it in
MAINHASH
. If found, return it, search done. Upgrade (see below): search also inCACHE
before going to step #2.For each tuple element
tuple[0<=i<M]
, search inINDEX[i]
for bothtuple[i]
(returns a vector ofPK
,EXACT[i]
) AND for*
(returns another vector ofPK
,FUZZY[i]
).With these two vectors, build another (temporary) hash
TMPHASH
, associating( PK, integer COUNT )
. It quite simple:COUNT
is initialized to1
if entry comes fromEXACT
, and0
if it comes fromFUZZY
.For next column, build
EXACT
andFUZZY
(see #2). But instead of making a newTMPHASH
, you'll MERGE the results into rather than creating a new temporary hash. Method is: ifTMPHASH
doesn't have thisPK
entry, trash this entry: it can't match at all. Otherwise, read theCOUNT
value, add1
or0
to it according to where it comes from, reinject it inTMPHASH
.Once all columns are done, you'll have to analyze
TMPHASH
.
Analyzing TMPHASH
First, if TMPHASH
is empty, then you don't have any suitable answer. Return that to user. If it contains only one entry, same: return to user directly.
For more than one element in TMPHASH
:
- Parse the whole
TMPHASH
container, searching for the maximumCOUNT
. Maintain in memory thePK
associated to the current maximum forCOUNT
. - Developper's choice: in case of multiple
COUNT
at the same maximum value, you can either return them all, return the first one, or the last one. COUNT
if obviously always stricly lower thanM
- otherwise, you would have found the tuple inMAINHASH
. This value, compared toM
, can give a confidence mark to your result (=100*COUNT/M
% of confidence).- You can also now store the original tuple searched, and the corresponding
PK
, in another hashmap calledCACHE
. Since it would be way too complicated to update properlyCACHE
when adding/modifying something inVECDATA
, simply purgeCACHE
when it occurs. It's only a cache, after all...
This is quite complex to implement if the language doesn't help you, in particular by allowing to redefine operators and having all base containers available, but it should work.
Exact matches / cached matches are in O(1)
. Fuzzy search is in O(n.M)
, n
being the number of matching rows (and 0<=n<N
, of course).
Without further researchs, I can't see anything better than that. It will consume an obscene amount of memory, but you said that it won't be an issue.
CodePudding user response:
I would recommend doing this with Tries that have a little data decorated. For routes, you want to know the lowest route ID so we can match to the first available route. For flights you want to track how many flights there are left to match.
What this will allow you to do, for instance, is partway through the match ONLY ONCE realize that flights from city1 to city2 might be matching routes that start off city1, city2
, or city1, *
or *, city2
, or *, *
without having to repeat that logic for each route or flight.
Here is a proof of concept in Python:
import heapq
import weakref
class Flight:
def __init__(self, fields, flight_no):
self.fields = fields
self.flight_no = flight_no
class Route:
def __init__(self, route_id, fields, baggage):
self.route_id = route_id
self.fields = fields
self.baggage = baggage
class SearchTrie:
def __init__(self, value=0, item=None, parent=None):
# value = # unmatched flights for flights
# value = lowest route id for routes.
self.value = value
self.item = item
self.trie = {}
self.parent = None
if parent:
self.parent = weakref.ref(parent)
def add_flight (self, flight, i=0):
self.value = 1
fields = flight.fields
if i < len(fields):
if fields[i] not in self.trie:
self.trie[fields[i]] = SearchTrie(0, None, self)
self.trie[fields[i]].add_flight(flight, i 1)
else:
self.item = flight
def remove_flight(self):
self.value -= 1
if self.parent and self.parent():
self.parent().remove_flight()
def add_route (self, route, i=0):
route_id = route.route_id
fields = route.fields
if i < len(fields):
if fields[i] not in self.trie:
self.trie[fields[i]] = SearchTrie(route_id)
self.trie[fields[i]].add_route(route, i 1)
else:
self.item = route
def match_flight_baggage(route_search, flight_search):
# Construct a heap of one search to do.
tmp_id = 0
todo = [((0, tmp_id), route_search, flight_search)]
# This will hold by flight number, baggage.
matched = {}
while 0 < len(todo):
priority, route_search, flight_search = heapq.heappop(todo)
if 0 == flight_search.value: # There are no flights left to match
# Already matched all flights.
pass
elif flight_search.item is not None:
# We found a match!
matched[flight_search.item.flight_no] = route_search.item.baggage
flight_search.remove_flight()
else:
for key, r_search in route_search.trie.items():
if key == '*': # Found wildcard.
for a_search in flight_search.trie.values():
if 0 < a_search.value:
heapq.heappush(todo, ((r_search.value, tmp_id), r_search, a_search))
tmp_id = 1
elif key in flight_search.trie and 0 < flight_search.trie[key].value:
heapq.heappush(todo, ((r_search.value, tmp_id), r_search, flight_search.trie[key]))
tmp_id = 1
return matched
# Sample data - the id is the position.
route_data = [
["NYC", "London", "American", "20KG"],
["NYC", "*", "Southwest", "30KG"],
["*", "*", "Southwest", "25KG"],
["*", "LA", "*", "20KG"],
["*", "*", "*", "15KG"],
]
routes = []
for i in range(len(route_data)):
data = route_data[i]
routes.append(Route(i, [data[0], data[1], data[2]], data[3]))
flight_data = [
["NYC", "London", "American"],
["NYC", "Dallas", "Southwest"],
["Dallas", "Houston", "Southwest"],
["Denver", "LA", "American"],
["Denver", "Houston", "American"],
]
flights = []
for i in range(len(flight_data)):
data = flight_data[i]
flights.append(Flight([data[0], data[1], data[2]], i))
# Convert to searches.
flight_search = SearchTrie()
for flight in flights:
flight_search.add_flight(flight)
route_search = SearchTrie()
for route in routes:
route_search.add_route(route)
print(route_search.match_flight_baggage(flight_search))