Search engines use a method called inverse (or reverse) indexing. In simplest terms, the reverse indexing strategy inverts keys and values: keys become values and values become keys. Instead of searching through all existing websites to find keywords, the search engine has a pre-made mapping from keywords to websites, so that for any given keyword, it can immediately list all websites where that keyword appears.
Your task is to write an algorithm that takes a list of strings as input and returns a dictionary, containing an index of the words to the matching strings.
In the dictionary, each key will be a word k, while the value will be a list of indices of the input strings where the word k appears.
Words should be treated as lowercase only. i.e. Hello and hello should be treated the same.
You can assume that the dataset will contain only lists of strings. You don't need to check the type of the elements in the dataset.
The string data in the dataset will be clean. This means you don't need to worry about cleaning i.e. removing punctation marks or numbers.
The reverse_index function is supposed to create and return the dictionary.
dataset = [
"Hello world",
"This is the WORLD",
"hello again"
]
res = reverse_index(dataset)
# This assertion checks if the result equals the expected dictinary
assert(res == {
'hello': [0, 2],
'world': [0, 1],
'this': [1],
'is': [1],
'the': [1],
'again':[2]
}
dataset = [
"Hello world",
"This is the WORLD",
"hello again"
]
def reverse_index(dataset):
index_dictionary = {}
print(reverse_index(dataset))
So my question is do i have to split the words in the list dataset, i.e. create a new list with individual items like newlist = ["hello", "world", "this", ....] and then use a for loop and to sort them all into a new dictionary.
Is this idea correct? I've been doodling but things dont seem to be correct.
CodePudding user response:
For each element in the dataset you need to convert its content to lowercase then isolate the tokens (words). Then populate your dictionary with the "line" in which the token appears within the dataset. Something like this:
dataset = [
"Hello world",
"This is the WORLD",
"hello again"
]
def reverse_index(ds):
result = {}
for i, e in enumerate(ds):
for k in e.lower().split():
result.setdefault(k, []).append(i)
return result
res = reverse_index(dataset)
assert res == {
'hello': [0, 2],
'world': [0, 1],
'this': [1],
'is': [1],
'the': [1],
'again': [2]
}