Split a long text in two or more parts each one with a maximum length in python-CodePudding

Let's suppose I have a long text that I want to process with an API having a maximum number of allowed characters (N). I would like to split that text into 2 or more texts with shorter than N characters, and based on a separator. I know I could split by separator but I would like to keep the number of output sub-texts the smallest as possible.

For example, suppose my text is:

"Lorem ipsum dolor sit amet, odio salutandi id nam, ferri nostro te duo. Eum ex odio habeo qualisque, ne eos natum graeco. Autem voluptatum ex mea. Nulla putent reformidans cu pro, posse recusabo reprehendunt pro no. An sit ludus oblique. Consulatu cotidieque ex sea, nam no duis prompta expetendis.

Est ne tempor quaestio complectitur, modo error vim et. Option voluptaria efficiantur te eam, ea appareat evertitur qui, te vix pertinax recteque. Mea eu diceret ceteros. Expetenda torquatos assueverit est ex, te reque voluptatibus signiferumque has."

which is 550 characters long. Let's suppose that N is 250. I would expect the text to be split in this way:

Part 1: "Lorem ipsum dolor sit amet, odio salutandi id nam, ferri nostro te duo. Eum ex odio habeo qualisque, ne eos natum graeco. Autem voluptatum ex mea. Nulla putent reformidans cu pro, posse recusabo reprehendunt pro no. An sit ludus oblique" (237 characters)
Part 2: "Consulatu cotidieque ex sea, nam no duis prompta expetendis.

Est ne tempor quaestio complectitur, modo error vim et. Option voluptaria efficiantur te eam, ea appareat evertitur qui, te vix pertinax recteque. Mea eu diceret ceteros." (232 characters)

Part 3: the remaining.

Any idea on how to do this in Python?

Thank you for any help. Francesca

CodePudding user response：

You can do that using regex:

import re


ouput = re.findall(r".{1,250}(?:\.|$)", data)
print(ouput)

.{1,250}: Matches any character between 1 and 250 times, as many times as possible.
\.: Matches a dot.
|: Or
$: Matches the end of the string.

You can also put the delimiter and the maximum length in a variable.

import re


num_max = 250
delimiter = re.escape('.')

ouput = re.findall(fr".{{1,{num_max}}}(?:{delimiter}|$)", data)
print(ouput)

Output:

[
    'Lorem ipsum dolor sit amet, odio salutandi id nam, ferri nostro te duo. Eum ex odio habeo qualisque, ne eos natum graeco. Autem voluptatum ex mea. Nulla putent reformidans cu pro, posse recusabo reprehendunt pro no. An sit ludus oblique.',
    ' Consulatu cotidieque ex sea, nam no duis prompta expetendis.',
    'Est ne tempor quaestio complectitur, modo error vim et. Option voluptaria efficiantur te eam, ea appareat evertitur qui, te vix pertinax recteque. Mea eu diceret ceteros. Expetenda torquatos assueverit est ex, te reque voluptatibus signiferumque has.'
]

CodePudding user response：


n = 250
text = """Lorem ipsum dolor sit amet, odio salutandi id nam, ferri nostro te duo. Eum ex odio habeo qualisque, ne eos natum graeco. Autem voluptatum ex mea. Nulla putent reformidans cu pro, posse recusabo reprehendunt pro no. An sit ludus oblique. Consulatu cotidieque ex sea, nam no duis prompta expetendis.

Est ne tempor quaestio complectitur, modo error vim et. Option voluptaria efficiantur te eam, ea appareat evertitur qui, te vix pertinax recteque. Mea eu diceret ceteros. Expetenda torquatos assueverit est ex, te reque voluptatibus signiferumque has."""

if len(text) >= 550:
  print(text[0:n-1])
  print(text[n:])
else:
  print(text)

So you can have a variable n with the length (250 in your example). Then it checks if the length of the text is greater or equal 550 chars. If yes it's going to print everything from char 0 up to the length n (minus 1 so you get the first 250 not the first 251 characters). Then it is going to do this for the second part: from n to the end.

CodePudding user response：

You can create a function, that can return the chunks of desired length.

In [13]: def split(N, text):
    ...:     chunks = [text[i:i N] for i in range(0, len(text), N-1)]
    ...:     return chunks

This will return the chunks in the format of list. i.e

text = "Lorem.................." # complete lorem ispm
chunks = split(250, text)
print(len(s[0]), len(s[1]), len(s[2]))

And the output lengths will be

250 250 50

CodePudding user response：

This is a possible solution:

def split_txt(txt, sep, n):
    if any(len(s)   1 > n for s in txt.split(sep)):
        raise Exception('The text cannot be split')
    result = []
    start = 0
    while start   n <= len(txt):
        result.append(txt[start:start   n].rsplit(sep, 1)[0]   sep)
        start  = len(result[-1])
    if start < len(txt):
        result.append(txt[start:])
    return result