Home > Net >  Splitting up a string to constraint length
Splitting up a string to constraint length

Time:08-15

I have input text of variable length. For example,

text = "person alpha:\nhi! how are you doing?\n\nperson beta:\nI am fine, thank you. What are you doing?\n\nperson alpha:\nI am at home watching tv.\n\nperson beta:\nThat sounds like a lot of fun. What are you watching?\n\nperson alpha:\nI'm watching a new TV series."

Now, I would like to restrict the length of the text, for example to 100 characters. An easy way to do this is

if len(text) > 100:
   text = text[len(text)-100:] 

The problem now is that with this approach the string is just broken up somewhere. I would like that it is only broken up at person alpha and person beta, so that the text always starts with person alpha:\n.... or person beta:\n.... Of course then the len of the text can also be shorter than 100 characters.

How can this be done efficiently?

Edit: Here is the original text:

person alpha: hi! how are you doing?

person beta: I am fine, thank you. What are you doing?

person alpha: I am at home watching tv.

person beta: That sounds like a lot of fun. What are you watching?

person alpha: I'm watching a new TV series.

When selecting the last 100 characters, the result should look like this:

person alpha: I'm watching a new TV series.

When selecting the last 470 characters, the result should look like this:

person beta: I am fine, thank you. What are you doing?

person alpha: I am at home watching tv.

person beta: That sounds like a lot of fun. What are you watching?

person alpha: I'm watching a new TV series.

CodePudding user response:

EDIT. To select from the end:

def get_split_point(text, n=100):
    """
    Return the index where to split text from the end:
        * the index is negative,
        * it is the position of the first letter "p" in the 
          "person..." which occurs first,
        * the chosen text from the end is always <= n characters,
        * -n is returned, if NO persons found in text_end.
    """
    text_end = text[-n:]

    alpha_index = text_end.find('person alpha:\n')
    beta_index = text_end.find('person beta:\n')
    if alpha_index == -1 and beta_index == -1:
        return -n  # both NOT found
    if alpha_index == -1:
        return beta_index - n  # alpha NOT found
    if beta_index == -1:
        return alpha_index - n  # beta NOT found
    return min(alpha_index, beta_index) - n  # both FOUND
def get_text_end(text, n=100):
    i = get_split_point(text, n)
    return text[i:]

Results:

>> print(get_text_end(text, 5))
ries.
>>> print(get_text_end(text, 100))
person alpha:
I'm watching a new TV series.
>>> print(get_text_end(text, 150))
person beta:
That sounds like a lot of fun. What are you watching?

person alpha:
I'm watching a new TV series.
>>> print(get_text_end(text, 400))
person alpha:
hi! how are you doing?

person beta:
I am fine, thank you. What are you doing?

person alpha:
I am at home watching tv.

person beta:
That sounds like a lot of fun. What are you watching?

person alpha:
I'm watching a new TV series.

To count the numbers of persons:

get_text_end(text, 400).count('person alpha:\n')  # 3

Original solution. To select from the start:

  • alpha_index is the index of the LAST 'person alpha:\n' in the first 114 characters. (Because len('person alpha:\n') == 14)

    It is the index where the first letter: 'p' of a string 'person alpha:\n' occurs.

  • beta_index - the index of the LAST 'person beta:\n' in the first 113 characters (len('person alpha:\n') == 13):

selected_alpha = text[:114]
alpha_index = selected_alpha.rfind('person alpha:\n')

selected_beta = text[:113]
beta_index = selected_beta.rfind('person beta:\n')

i = max(alpha_index, beta_index)
if i == -1:
    i = 100  # if NO persons found in selected_text

Then you can split your text into 2 parts:

chosen = text[:i]
rest = text[i:]
  • Related