Home > Net >  Is there a Pythonic way of truncating a Unicode string by a maximum number of bytes?
Is there a Pythonic way of truncating a Unicode string by a maximum number of bytes?

Time:12-10

If an API accepts some string value with limit on a number of bytes, but accept Unicode, is there a better way to shorten the string with valid Unicode?

def truncate(string: str, length: int):
    """Shorten an Unicode string to a certain length of bytes."""
    if len(string.encode()) <= length:
        return string

    chars = list(string)
    while sum(len(char.encode()) for char in chars) > length:
        chars.pop(-1)

    return "".join(chars)

CodePudding user response:

This should work:

bytes_ = string.encode()
try:
    return bytes_[:length].decode()
except UnicodeDecodeError as err:
    return bytes_[:err.start].decode()

Basically we truncate at the first decoding error.UTF-8 is a prefix code. Therefore the decoder should always be able to see when the string is truncated in the middle of a character. Weirdness may occur with accents and stuff. I have not thought this one through. Maybe we need some normalization, too.

  • Related