Home > Software design >  Finding similar words in text with regex and getting index of them
Finding similar words in text with regex and getting index of them

Time:10-30

Here is the data I have.

t = 'Billy and Willy and Billy and someone'
words = ['Billy', 'Willy', 'Billy']

I planned to find the words in order. First I find Billy, then I shorten the line until the end of the word Billy.

for example:

new_t = ' and Willy and Billy and someone'

And then I planned to find Willy and etc.

So here what I have written:

t = 'Billy and Willy and Billy and someone'
words = ['Billy', 'Willy', 'Billy']
indexes = []
j = 0
for i in words:
    l = re.search(i, t[j:]).span()
    indexes.append(l)
    j = l[1]

I know I did wrong, but can you help me to get result like this:

Billy = (0,5)
Willy = (10,15)
Billy = (20,25)

CodePudding user response:

To find exact substrings, you don't need re. You can instead use str.index:

t = 'Billy and Willy and Billy and someone'
words = ['Billy', 'Willy', 'Billy']

indexes = []
current_pos = 0
for word in words:
    ind = t.index(word, current_pos)
    indexes.append((ind, ind   len(word)))
    current_pos = ind   1

print(indexes) # [(0, 5), (10, 15), (20, 25)]
for w, i in zip(words, indexes):
    print(w, '=', i)
# Billy = (0, 5)
# Willy = (10, 15)
# Billy = (20, 25)

The second parameter of index is the starting position of the search, so you only need to update the starting position (current_pos) once a search is done.

Or with walrus operator (python 3.8 ), you can shorten the second paragraph into

b = 0
indexes = [(a := t.index(w, b), b := a   len(w)) for w in words]

CodePudding user response:

Using re:

import re

t = 'Billy and Willy and Billy and someone'
words = 'Billy', 'Willy'

for match in re.finditer('|'.join(words), t):
    print(f"{match[0]} = {match.span()}")

Billy = (0, 5)
Willy = (10, 15)
Billy = (20, 25)
  • Related