I built this inverted index:
{
'experiment': {'d1': [1, [0]], ..., 'd30': [2, [12, 40]], ..., 'd123': [3, [11, 45, 67]], ...},
'studi': {'d1': [1, [1]], 'd2': [2, [0, 36]], ..., 'd207': [3, [19, 44, 59]], ...}
}
For example, the term experiment
appears in document 1 one time at index zero, in document 30 two times at indices 12 and 40, etc. I am wondering how I could count the number of occurrences of each term in the dictionary based on a dictionary of queries that looks like this:
{
'q1' : ['similar', 'law', ..., 'speed', 'aircraft'],
'q2' : ['structur', 'aeroelast', ..., 'speed', 'aircraft'],
...
'q225': ['design', 'factor', ..., 'number', '5']
}
The desired output would look something like this:
{
'q1' : ['d51', 'd874', ..., 'd717'],
'q2' : ['d51', 'd1147', ..., 'd14'],
...,
'q225': ['d1313', 'd996', ..., 'd193']
}
With keys representing the query and values representing the documents that the query appeared in, and the list would be sorted in descending order of total term frequencies
CodePudding user response:
Map queries to document vectors
A document vector is a dict with items (document, word_count)
. These vectors can be added together by summing the word count for matching document keys with a default word_count of 0.
CONVERT INDEX TO DOC VECTORS
full_index = {
'experiment': {'d1': [1, [0]], 'd30': [2, [12, 40]], 'd123': [3, [11, 45, 67]] } ,
'study': {'d1': [1, [1]], 'd2': [2, [0, 36]], 'd207': [3, [19, 44, 59]]}
}
def count_only(docs):
return {d: occurences[0] for d, occurences in docs.items()}
doc_vector_index = {w: count_only(docs) for w, docs in full_index.items()}
MAP LIST OF QUERY WORDS TO LIST OF DOC VECTORS
for q, words in queries.items():
vectors = [doc_vector_index[word] for word in words if word in doc_vector_index.keys()]
SUM DOC VECTORS AND SORT
def doc_vector_add(ldoc, rdoc):
res = ldoc.copy()
for doc, count in rdoc.items():
res[doc] = ldoc.get(doc,0) count
return res
for q, words in queries.items():
vectors = [doc_vector_index[word] for word in words if word in doc_vector_index.keys()]
total_vector = dict(sorted(functools.reduce(doc_vector_add, vectors, {}).items(),
key=lambda item: item[1],
reverse=True))
output[q] = list(total_vector.keys())
The summation of doc vectors is handled using reduce functools.reduce(doc_vector_add, vectors, {})
. This produces the doc vector that is the sum of the individual vectors for each word in the query. sorted
is used to sort the keys of the vector.
LIMIT TO TOP N DOCUMENTS
max_doc_limit = 10
output[q] = list(total_vector.keys())[:max_doc_limit]
Limiting the documents can be handled by slicing before assigning to the output.
ORDER BY COUNT DESC, DOC_ID ASC
sorted(...,key=lambda item: (item[1], -1*int(item[0][1:]),...)
We can change the sorting order of the output by changing the key function passed to sorted
. We use a trick of multiplying the second element in the tuple by -1 to reverse the order from descending to ascending.