I have this string:
myString = "Tomorrow will be very very rainy"
I would like to get the start index of the word number 5 (very).
What I do currently, I do split myString into words:
words = re.findall( r'\w |[^\s\w] ', myString)
But I am not sure on how to get the start index of the word number 5: words[5].
Using the index() is not working as it finds the first occurrence:
start_index = myString.index(words[5])
CodePudding user response:
Not very elegant, but loop through the list of split words and calculate the index based on the word length and the split character (in this case a space). This answer will target the fifth word in the sentence.
myString = "Tomorrow will be very very rainy"
target_word = 5
split_string = myString.split()
idx_start = 0
for i in range(target_word-1):
idx_start = len(split_string[i])
if myString[idx_start] == " ":
idx_start = 1
idx_end = idx_start len(split_string[target_word-1]) 1
print(idx_start, idx_end, myString[idx_start:idx_end])
CodePudding user response:
wordnum = 5
l = [x.span()[1] for x in re.finditer(" ", string)]
pos = l[wordnum-2]
print(pos)
output
22
CodePudding user response:
If only single spaces between words:
- Sum all word lengths before the wanted word
- Add amount of spaces
word_idx = 4 # zero based index
words = myString.split()
start_index = sum(len(word) for word in words[:word_idx]) word_idx
Result:
22
CodePudding user response:
If the string starts with 5 words, you can match the first 4 words and capture the fifth one.
The you can use the start
method and pass 1 to it for the first capture group of the Match Object.
^(?:\w \s ){4}(\w )
Explanation
^
Start of string(?:\w \s ){4}
Repeat 4 times matching 1 word characters and 1 whitspace chars(\w )
Capture group 1, match 1 word characters
Example
import re
myString = "Tomorrow will be very very rainy"
pattern = r"^(?:\w \s ){4}(\w )"
m = re.match(pattern, myString)
if m:
print(m.start(1))
Output
22
For a broader match you can use \S
to match one or more non whitespace characters.
pattern = r"^(?:\S \s ){4}(\S )"