how do I write a program that takes a list of strings as input and returns a dictionary, containing-CodePudding

Rules: In the dictionary, each key will be a word k, while the value will be a list of indices of the input strings where the word k appears.

Words should be treated as lowercase only. i.e. Hello and hello should be treated the same.

it can be assumed that the dataset will contain only lists of strings. no need to check the type of the elements in the dataset.

The string data in the dataset will be clean. no need to worry about cleaning i.e. removing punctation marks or numbers.

In the example below, the function determines what the indices of the words in the given dataset are. dataset is the list containing the strings.

The reverse_index function is supposed to create and return the dictionary.


dataset = [
    "Hello world",
    "This is the WORLD",
    "hello again"
 ]
res = reverse_index(dataset)

# This assertion checks if the result equals the expected dictinary
assert(res == {
    'hello': [0, 2],
    'world': [0, 1],
    'this': [1],
    'is': [1],
    'the': [1],
    'again':[2]
  })

I'm not really sure of what to do next but this is how I started

dataset = [
    "Hello world",
    "This is the WORLD",
    "hello again"
 ] 

def reverse_index(dataset):

CodePudding user response：

You can use collections.defaultdict as a basis and a small loop:

from collections import defaultdict
res = defaultdict(list)
for i,s in enumerate(dataset):
    for w in set(map(str.lower, s.split())):
        res[w].append(i)
dict(res)

output:

{'hello': [0, 2],
 'world': [0, 1],
 'is': [1],
 'the': [1],
 'this': [1],
 'again': [2]}

CodePudding user response：

You can try this method

def reverse_index(data):
    res = dict()
    for i in range(len(data)):
        for word in map(str.lower,data[i].split()):
            if word not in res:
                res[word] = [i]
            else:
                res[word].append(i)
    return res

output:

{
    'hello': [0, 2],
    'world': [0, 1],
    'this': [1],
    'is': [1],
    'the': [1],
    'again':[2]
}