I have input text of variable length. For example,
text = "person alpha:\nhi! how are you doing?\n\nperson beta:\nI am fine, thank you. What are you doing?\n\nperson alpha:\nI am at home watching tv.\n\nperson beta:\nThat sounds like a lot of fun. What are you watching?\n\nperson alpha:\nI'm watching a new TV series."
Now, I would like to restrict the length of the text, for example to 100 characters. An easy way to do this is
if len(text) > 100:
text = text[len(text)-100:]
The problem now is that with this approach the string is just broken up somewhere. I would like that it is only broken up at person alpha
and person beta
, so that the text always starts with person alpha:\n....
or person beta:\n...
. Of course then the len of the text can also be shorter than 100 characters.
How can this be done efficiently?
Edit: Here is the original text:
person alpha: hi! how are you doing?
person beta: I am fine, thank you. What are you doing?
person alpha: I am at home watching tv.
person beta: That sounds like a lot of fun. What are you watching?
person alpha: I'm watching a new TV series.
When selecting the last 100 characters, the result should look like this:
person alpha: I'm watching a new TV series.
When selecting the last 470 characters, the result should look like this:
person beta: I am fine, thank you. What are you doing?
person alpha: I am at home watching tv.
person beta: That sounds like a lot of fun. What are you watching?
person alpha: I'm watching a new TV series.
CodePudding user response:
EDIT. To select from the end:
def get_split_point(text, n=100):
"""
Return the index where to split text from the end:
* the index is negative,
* it is the position of the first letter "p" in the
"person..." which occurs first,
* the chosen text from the end is always <= n characters,
* -n is returned, if NO persons found in text_end.
"""
text_end = text[-n:]
alpha_index = text_end.find('person alpha:\n')
beta_index = text_end.find('person beta:\n')
if alpha_index == -1 and beta_index == -1:
return -n # both NOT found
if alpha_index == -1:
return beta_index - n # alpha NOT found
if beta_index == -1:
return alpha_index - n # beta NOT found
return min(alpha_index, beta_index) - n # both FOUND
def get_text_end(text, n=100):
i = get_split_point(text, n)
return text[i:]
Results:
>> print(get_text_end(text, 5))
ries.
>>> print(get_text_end(text, 100))
person alpha:
I'm watching a new TV series.
>>> print(get_text_end(text, 150))
person beta:
That sounds like a lot of fun. What are you watching?
person alpha:
I'm watching a new TV series.
>>> print(get_text_end(text, 400))
person alpha:
hi! how are you doing?
person beta:
I am fine, thank you. What are you doing?
person alpha:
I am at home watching tv.
person beta:
That sounds like a lot of fun. What are you watching?
person alpha:
I'm watching a new TV series.
To count the numbers of persons:
get_text_end(text, 400).count('person alpha:\n') # 3
Original solution. To select from the start:
alpha_index
is the index of the LAST 'person alpha:\n' in the first 114 characters. (Becauselen('person alpha:\n') == 14
)It is the index where the first letter: 'p' of a string 'person alpha:\n' occurs.
beta_index
- the index of the LAST 'person beta:\n' in the first 113 characters (len('person alpha:\n') == 13
):
selected_alpha = text[:114]
alpha_index = selected_alpha.rfind('person alpha:\n')
selected_beta = text[:113]
beta_index = selected_beta.rfind('person beta:\n')
i = max(alpha_index, beta_index)
if i == -1:
i = 100 # if NO persons found in selected_text
Then you can split your text
into 2 parts:
chosen = text[:i]
rest = text[i:]