Home > database >  finding the index of word and its distance
finding the index of word and its distance

Time:08-24

hello there ive been trying to code a script that finds the distance the word one to the other one in a text but the code somehow doesnt work well...

import re

#dummy text
words_list = ['one']
long_string = "are marked by one  the ()meta-characters. two They group together the expressions contained one inside them, and you can one repeat the contents of a group with a repeating qualifier, such as there one"

striped_long_text = re.sub('  ', ' ',(long_string.replace('\n',' '))).strip()
length = []

index_items = {}

for item in words_list :
    text_split = striped_long_text.split('{}'.format(item))[:-1]

    for space in text_split:
        if space:
            length.append(space.count(' ')-1)

print(length)

dummy Text :

are marked by one the ()meta-characters. two They group together the expressions contained one inside them, and you can one repeat the contents of a group with a repeating qualifier, such as there one

so what im trying to do is that find the exact distance the word one has to its next similar one in text as you see , this code works well with in some exceptions... if the first text has the word one after 2-3 words the index starts to the counting from the start of the text and this will fail the output but for example if add the word one at the start of the text the code will work fine...

Output of the code vs what it should be :


result = [2, 9, 5, 13] expected result = [ 9, 5, 13]

CodePudding user response:

Sorry, I can't ask questions in comments yet, but if you want exactly to use regular expressions, than this can help you:

import re

def forOneWord(word):
    length = []
    lastIndex = -1
    for i in range(len(words)):
        if words[i].lower() == word:
            if lastIndex != -1:
                length.append(i - lastIndex - 1)
            lastIndex = i
    return length

def forWordsList(words_list):
    for i in range(len(words_list)):  # Make all words lowercase
        words_list[i] = words_list[i].lower()
        
    length = [[] for i in range(len(words_list))]
    lastIndex = [-1 for i in range(len(words_list))]
    for i in range(len(words)):  # For all words in string
        for j in range(len(words_list)):  # For all words in list
            if words[i].lower() == words_list[j]:
                if lastIndex[j] != -1:  # This means that this is not the first match
                    length[j].append(i - lastIndex[j] - 1)
                lastIndex[j] = i
    return length
    

words_list = ['one' , 'the', 'a']
long_string = "are marked by one  the ()meta-characters. two They group together the expressions contained one inside them, and you can one repeat the contents of a group with a repeating qualifier, such as there one"

words = re.findall(r"(\w[\w!-]*) ", long_string)

print(forOneWord('one'))
print(forWordsList(words_list))

There are two functions: one just for single-word search and another to search by list. Also if it is not necessary to use RegEx, I can improve this solution in terms of performance if you want so.

CodePudding user response:

Another solution, using itertools.groupby:

from itertools import groupby

words_list = ["one"]
long_string = "are marked by one  the ()meta-characters. two They group together the expressions contained one inside them, and you can one repeat the contents of a group with a repeating qualifier, such as there one"

for w in words_list:
    out = [
        (v, sum(1 for _ in g))
        for v, g in groupby(long_string.split(), lambda k: k == w)
    ]

    # check first and last occurence:
    if out and out[0][0] is False:
        out = out[1:]

    if out and out[-1][0] is False:
        out = out[:-1]

    out = [length for is_word, length in out if not is_word]

print(out)

Prints:

[9, 5, 13]
  • Related