Python: find the start index of a specific word number in a string-CodePudding

I have this string:

myString = "Tomorrow will be very very rainy"

I would like to get the start index of the word number 5 (very).

What I do currently, I do split myString into words:

words = re.findall( r'\w |[^\s\w] ', myString)

But I am not sure on how to get the start index of the word number 5: words[5].

Using the index() is not working as it finds the first occurrence:

start_index = myString.index(words[5])

CodePudding user response：

Not very elegant, but loop through the list of split words and calculate the index based on the word length and the split character (in this case a space). This answer will target the fifth word in the sentence.

myString = "Tomorrow will be very very rainy"

target_word = 5

split_string = myString.split()

idx_start = 0

for i in range(target_word-1):
    idx_start  = len(split_string[i])
    if myString[idx_start] == " ":
        idx_start  = 1

idx_end = idx_start   len(split_string[target_word-1])   1

print(idx_start, idx_end, myString[idx_start:idx_end])

CodePudding user response：

wordnum = 5
l = [x.span()[1] for x in re.finditer("  ", string)]
pos = l[wordnum-2]
print(pos)

output

CodePudding user response：

If only single spaces between words:

Sum all word lengths before the wanted word
Add amount of spaces

word_idx = 4  # zero based index
words = myString.split()
start_index = sum(len(word) for word in words[:word_idx])   word_idx

Result:

CodePudding user response：

If the string starts with 5 words, you can match the first 4 words and capture the fifth one.

The you can use the start method and pass 1 to it for the first capture group of the Match Object.

^(?:\w \s ){4}(\w )

Explanation

^ Start of string
(?:\w \s ){4} Repeat 4 times matching 1 word characters and 1 whitspace chars
(\w ) Capture group 1, match 1 word characters

Example

import re

myString = "Tomorrow will be very very rainy"
pattern = r"^(?:\w \s ){4}(\w )"
m = re.match(pattern, myString)
if m:
    print(m.start(1))

Output

For a broader match you can use \S to match one or more non whitespace characters.

pattern = r"^(?:\S \s ){4}(\S )"