I have a text transcript of dialogue, consisting of strings of variable length. The string lengths can be anywhere from a few characters to thousands of characters.
I want Python to transform the text so that any line is maximally n characters. To make the partitioning natural, I want to recursively partition the lines by the last occurrence of any of the delimiters .
, ,
, ?
, !
. For example, let's assume that the below 72-character string is above a threshold of 36 characters:
This, is a long, long string. It is around(?) 72 characters! Pretty cool
Since the string is longer than 36 characters, the function should recursively partition the string by the last occurrence of any delimiter within 36 characters. Recursively meaning that if the resulting partitioned strings are longer than 36 characters, they should also be split according to the same rule. In this case, it should result in a list like:
['This, is a long, long string. ', 'It is around(?) 72 characters! ', 'Pretty cool']
The list items are respectively 30, 31, and 11 characters. None were allowed to be over 36 characters long. Note that the partitions in this example do not occur at a ,
delimiter, because those weren't the last delimiters within the 36 character threshold.
The partition sequence would've been something like:
'This, is a long, long string. It is around(?) 72 characters! Pretty cool' # 72
['This, is a long, long string. ', 'It is around(?) 72 characters! Pretty cool'] # 30 42
['This, is a long, long string. ', 'It is around(?) 72 characters! ', ' Pretty cool'] # 30 31 11
In the odd situation that there are no delimiters in the string or resulting recursive partitions, the strings should be wrapped using something like textwrap.wrap()
to max 36 characters, which produces a list which in the absence of delimiters would be:
['There are no delimiters here so I am', ' partitioned at 36 characters] # 36 29
I've tried to work out a Python function algorithm to achieve this, but it has been difficult. I spent long time in ChatGPT and couldn't get it to work despite many prompts.
Is there a Python module function that can achieve this already, or alternatively can you suggest a function will solve this problem?
I am attaching two of the ChatGPT attempts below for reference, but unfortunately they do not work, because if the line is above the threshold of 36 characters, it will split the line by each occurrence of a delimiter instead of by the last delimiter closest to the 36 character limit. I wasn't able to resolve the issue, but providing the code below in case it gives you any ideas. MAX_COUNT
was included to prevent an endless recursion loop, but I think it's superfluous if one adds a textwrap.wrap()
method for situations when there are no delimiters.
line = "This is a very long line of text that goes on and on and on and on. It contains a lot of words and sentences, and it is quite difficult to read. However, despite its length, it is still quite interesting and engaging! Or is it?"
import re
adjusted_lines = []
def split_line(line, count=0):
split_lines = []
MAX_COUNT = 1000
if count < MAX_COUNT:
if len(line) > 36:
match = re.search(r'[.,?!](?=(.{0,31}\s))', line[::-1])
if match:
left = line[-match.start()-1:]
right = line[:-match.start()-1]
split_lines = [left] split_line(right, count 1)
else:
split_lines.append(line)
else:
split_lines.append(line)
else:
split_lines.append(line)
return split_lines
adjusted_lines.extend(split_line(line))
print(adjusted_lines)
Another attempt is also wrong in the same way: if the line is above the threshold of 36 characters, it will partition the line by each occurrence of a delimiter instead of by the last delimiter closest to the 36 character limit:
line = "This is a very long line of text that goes on and on and on and on. It contains a lot of words and sentences, and it is quite difficult to read. However, despite its length, it is still quite interesting and engaging! Or is it?"
import textwrap
adjusted_lines = []
def partition_string(s):
partitions = []
if len(s) <= 36:
partitions.append(s)
return partitions
index = -1
delimiter = ""
for d in [". ", ", ", "? ", "! "]:
last_index = s.rfind(d)
if last_index != -1:
if last_index > index:
index = last_index
delimiter = d
if index != -1:
left_part = s[:index len(delimiter)].rstrip()
right_part = s[index len(delimiter):]
partitions.extend(partition_string(left_part))
partitions.extend(partition_string(right_part))
else:
partitions.extend(textwrap.wrap(s, width=36))
return partitions
adjusted_lines.extend(partition_string(line))
print(adjusted_lines)
NB: Character count online tool: https://www.charactercountonline.com/
CodePudding user response:
You can use rfind
to get the last occurrence of a delimiter in the first n
characters of a string.
def partition(s, n):
if len(s) <= n: return [s]
idx = max(s.rfind(c, 0, n) for c in ['.', ',', '?', '!'])
return [s] if idx == -1 else [s[0:idx 2], *partition(s[idx 2:], n)]
print(partition('This, is a long, long string. It is around(?) 72 characters! Pretty cool', 36))