hello there ive been trying to code a script that finds the distance the word one
to the other one
in a text but the code somehow doesnt work well...
import re
#dummy text
words_list = ['one']
long_string = "are marked by one the ()meta-characters. two They group together the expressions contained one inside them, and you can one repeat the contents of a group with a repeating qualifier, such as there one"
striped_long_text = re.sub(' ', ' ',(long_string.replace('\n',' '))).strip()
length = []
index_items = {}
for item in words_list :
text_split = striped_long_text.split('{}'.format(item))[:-1]
for space in text_split:
if space:
length.append(space.count(' ')-1)
print(length)
dummy Text :
are marked by one the ()meta-characters. two They group together the expressions contained one inside them, and you can one repeat the contents of a group with a repeating qualifier, such as there one
so what im trying to do is that find the exact distance the word one
has to its next similar one
in text as you see , this code works well with in some exceptions... if the first text has the word one
after 2-3 words the index starts to the counting from the start of the text and this will fail the output but for example if add the word one
at the start of the text the code will work fine...
Output of the code vs what it should be :
result = [2, 9, 5, 13]
expected result = [ 9, 5, 13]
CodePudding user response:
Sorry, I can't ask questions in comments yet, but if you want exactly to use regular expressions, than this can help you:
import re
def forOneWord(word):
length = []
lastIndex = -1
for i in range(len(words)):
if words[i].lower() == word:
if lastIndex != -1:
length.append(i - lastIndex - 1)
lastIndex = i
return length
def forWordsList(words_list):
for i in range(len(words_list)): # Make all words lowercase
words_list[i] = words_list[i].lower()
length = [[] for i in range(len(words_list))]
lastIndex = [-1 for i in range(len(words_list))]
for i in range(len(words)): # For all words in string
for j in range(len(words_list)): # For all words in list
if words[i].lower() == words_list[j]:
if lastIndex[j] != -1: # This means that this is not the first match
length[j].append(i - lastIndex[j] - 1)
lastIndex[j] = i
return length
words_list = ['one' , 'the', 'a']
long_string = "are marked by one the ()meta-characters. two They group together the expressions contained one inside them, and you can one repeat the contents of a group with a repeating qualifier, such as there one"
words = re.findall(r"(\w[\w!-]*) ", long_string)
print(forOneWord('one'))
print(forWordsList(words_list))
There are two functions: one just for single-word search and another to search by list. Also if it is not necessary to use RegEx, I can improve this solution in terms of performance if you want so.
CodePudding user response:
Another solution, using itertools.groupby
:
from itertools import groupby
words_list = ["one"]
long_string = "are marked by one the ()meta-characters. two They group together the expressions contained one inside them, and you can one repeat the contents of a group with a repeating qualifier, such as there one"
for w in words_list:
out = [
(v, sum(1 for _ in g))
for v, g in groupby(long_string.split(), lambda k: k == w)
]
# check first and last occurence:
if out and out[0][0] is False:
out = out[1:]
if out and out[-1][0] is False:
out = out[:-1]
out = [length for is_word, length in out if not is_word]
print(out)
Prints:
[9, 5, 13]