Mapping Each Word In A String's Start and End Index To Dictionary-CodePudding

I'm trying to find the index range (start index and end index, spaces are omitted, and with indexing starting at 1 for human readability.) of each word in a string. What I thought was the best approach was doing a list of lists where each nested list contains the word and a list of the start and end index. From a sample string, I get the following list:

text = "i have a list of lists that contain a word and there indices my method works except with repeated words like of or a or the or it"

yields:

boundaries_list=[['i', [1, 1]], ['have', [3, 6]], ['a', [4, 4]], ['list', [10, 13]], ['of', [15, 16]], ['lists', [18, 22]], ['that', [24, 27]], ['contain', [29, 35]], ['a', [4, 4]], ['word', [39, 42]], ['and', [44, 46]], ['there', [48, 52]], ['indices', [54, 60]], ['my', [62, 63]], ['method', [65, 70]], ['works', [72, 76]], ['except', [78, 83]], ['with', [85, 88]], ['repeated', [90, 97]], ['words', [99, 103]], ['like', [105, 108]], ['of', [15, 16]], ['or', [40, 41]], ['a', [4, 4]], ['or', [40, 41]], ['the', [48, 50]], ['or', [40, 41]], ['it', [86, 87]]]

This works, but its not very readable. Would sure be nice to compile it into a dictionary. Dictionaries work, except for when you have more than one of the same key. For me, that means that the first occurrence of a repeated word will be the ONLY occurrence of that word to be incorporated into the dictionary, thus excluding the index range of any other occurrences of that repeated word.

To get around this I tried using defaultdict,on a list of dictionaries but this only gave me the first word's index range repeated by n amount of word occurrences.

For Example:

for one_d in boundaries_list:

    nested_list_to_nested_dict = dict({one_d[0]:one_d[1]  })
    new_list.append(nested_list_to_nested_dict)


res = defaultdict(list)

for d in new_list:
    for k, v in d.items():
        res[k].append(v)

print(res)
>>> defaultdict(<class 'list'>, {'i': [[1, 1]], 'have': [[3, 6]], 'a': [[4, 4], [4, 4], [4, 4]], 'list': [[10, 13]], 'of': [[15, 16], [15, 16]], 'lists': [[18, 22]], 'that': [[24, 27]], 'contain': [[29, 35]], 'word': [[39, 42]], 'and': [[44, 46]], 'there': [[48, 52]], 'indices': [[54, 60]], 'my': [[62, 63]], 'method': [[65, 70]], 'works': [[72, 76]], 'except': [[78, 83]], 'with': [[85, 88]], 'repeated': [[90, 97]], 'words': [[99, 103]], 'like': [[105, 108]], 'or': [[40, 41], [40, 41], [40, 41]], 'the': [[48, 50]], 'it': [[86, 87]]})

Any help is much appreciated.

CodePudding user response：

You can use re, with start and end attributes of match objects:

import re
from collections import defaultdict

text = "i have a list of lists that contain a word and there indices my method works except with repeated words like of or a or the or it"

output = defaultdict(list)
for m in re.finditer(r"\S ", text):
    output[m.group(0)].append((m.start(0) 1, m.end(0)))

print(output)
# defaultdict(<class 'list'>, {'i': [(1, 1)], 'have': [(3, 6)], 'a': [(8, 8), (37, 37), (116, 116)], 'list': [(10, 13)], 'of': [(15, 16), (110, 111)], 'lists': [(18, 22)], 'that': [(24, 27)], 'contain': [(29, 35)], 'word': [(39, 42)], 'and': [(44, 46)], 'there': [(48, 52)], 'indices': [(54, 60)], 'my': [(62, 63)], 'method': [(65, 70)], 'works': [(72, 76)], 'except': [(78, 83)], 'with': [(85, 88)], 'repeated': [(90, 97)], 'words': [(99, 103)], 'like': [(105, 108)], 'or': [(113, 114), (118, 119), (125, 126)], 'the': [(121, 123)], 'it': [(128, 129)]})

CodePudding user response：

I added a double space just for testing

text = "i have a  list of lists that contain a word and there indices my method works except with repeated words like of or a or the or it"

from collections import defaultdict
new_dict = defaultdict(list)
offset = 0
for word in text.split(" "):
    new_dict[word].append([offset, offset len(word)])
    offset  = len(word)   1;

new_dict

Output:

defaultdict(list,
            {'i': [[0, 1]],
             'have': [[2, 6]],
             'a': [[7, 8], [37, 38], [116, 117]],
             '': [[9, 9]],
             'list': [[10, 14]],
             'of': [[15, 17], [110, 112]],
             'lists': [[18, 23]],
             'that': [[24, 28]],
             'contain': [[29, 36]],
             'word': [[39, 43]],
             'and': [[44, 47]],
             'there': [[48, 53]],
             'indices': [[54, 61]],
             'my': [[62, 64]],
             'method': [[65, 71]],
             'works': [[72, 77]],
             'except': [[78, 84]],
             'with': [[85, 89]],
             'repeated': [[90, 98]],
             'words': [[99, 104]],
             'like': [[105, 109]],
             'or': [[113, 115], [118, 120], [125, 127]],
             'the': [[121, 124]],
             'it': [[128, 130]]})

The dict indices give exactly the start and end of the slice of the string. E.g. text[128:130] is equal to 'it'