I am given a string and a character position in that string. I want to get the n words before that position in a way that it does not include the last word if the character positon is in the middle of a word
text = 'the house is big the house is big the house is big'
char_nr = 19
list_of_words_before = text[:char_nr-1].split()
print(list_of_words_before) # we see that the string is splited in "the" I dont want hence the t in the list
nr_words = 3
if nr_words >len(list_of_words_before):
nr_words = len(list_of_words_before)
list_of_words_before[-nr_words:]
this gives:
['the', 'house', 'is', 'big', 't']
['is', 'big', 't']
but actually what I really want is ['house', 'is','big'] since t is just part of a word.
How would you make sure in the first place to divide by a space between words? Is any other solution?
CodePudding user response:
Using regex:
>>> import re
>>> text = 'the house is big the house is big the house is big'
>>> result = re.match(r".{0,18}\b", text).group(0).split()
>>> result
['the', 'house', 'is', 'big']
>>> result[-3:]
['house', 'is', 'big']
Explanation:
.
any character{0,18}
match the preceding (.
) 0 to 18 times, as many as possible\b
the match ends in a beginning or ending of a word, so we don't get partial words
CodePudding user response:
Maybe something like this:
text = 'the house is big the house is big the house is big'
char_nr = 19
list_of_words_before = text[:char_nr - 1]
splitted = list_of_words_before.split()
if list_of_words_before[-1] != ' ':
splitted = splitted[:-1]
nr_words = 3
print(splitted[-nr_words:])
Output:
['house', 'is', 'big']
CodePudding user response:
You can check the character at char_nr
and if it's a non-word character then the splitting was correct, otherwise you need to remove the last item from the list. Assuming that " "
is the only character that can occur between words:
if text[char_nr] != " ":
list_of_words_before = list_of_words_before[:-1]
CodePudding user response:
I think this is what you're looking for:
def get_n_words(text, char_nr, nr_words):
if text[char_nr-1] == " ":
list_of_words_before = text[:char_nr-1].split()
else:
list_of_words_before = text[:char_nr-1].split()[:-1]
print(list_of_words_before)
if nr_words >len(list_of_words_before):
nr_words = len(list_of_words_before)
print(list_of_words_before[-nr_words:])
text_1 = 'the house is big the house is big the house is big'
text_2 = 'the house is big a house is big the house is big'
print("Last word truncated:")
get_n_words(text_1, 19, 3)
print("\nLast word not truncated:")
get_n_words(text_2, 19, 3)
That has the following output:
Last word truncated:
['the', 'house', 'is', 'big']
['house', 'is', 'big']
Last word not truncated:
['the', 'house', 'is', 'big', 'a']
['is', 'big', 'a']
CodePudding user response:
You might use a pattern starting the match with a non whitespace character using \S
and then match 0-18 times any character using .{0,18}
while asserting not a non whitespace character to the right using a negative lookahead (?!\S)
\S.{0,18}(?!\S)
import re
text = 'the house is big the house is big the house is big'
char_nr = 19
pattern = rf"\S.{{0,{char_nr - 1}}}(?!\S)"
strings = re.findall(pattern, text)
print(strings)
list_of_words_before = strings[1].split()
print(list_of_words_before)
nr_words = 3
lenOfWordsBefore = len(list_of_words_before)
if nr_words > lenOfWordsBefore:
nr_words = lenOfWordsBefore
print(list_of_words_before[-nr_words:])
Output
['the house is big', 'the house is big', 'the house is big']
['the', 'house', 'is', 'big']
['house', 'is', 'big']