If an API accepts some string value with limit on a number of bytes, but accept Unicode, is there a better way to shorten the string with valid Unicode?
def truncate(string: str, length: int):
"""Shorten an Unicode string to a certain length of bytes."""
if len(string.encode()) <= length:
return string
chars = list(string)
while sum(len(char.encode()) for char in chars) > length:
chars.pop(-1)
return "".join(chars)
CodePudding user response:
This should work:
bytes_ = string.encode()
try:
return bytes_[:length].decode()
except UnicodeDecodeError as err:
return bytes_[:err.start].decode()
Basically we truncate at the first decoding error.UTF-8 is a prefix code. Therefore the decoder should always be able to see when the string is truncated in the middle of a character. Weirdness may occur with accents and stuff. I have not thought this one through. Maybe we need some normalization, too.