In Swift, how can I find the byte length of a UTF8 string?-CodePudding

Note: This is only an issue if you have used UTF8 encoding inside a serialized buffer or you have a sequence of UTF8 strings. If you have a single UTF8 encoded string inside a 23 byte byte-array, then you obviously know the length of the UTF8 encoding is 23 bytes or less, and you probably don't care if there are extra bytes at the end.

I have a Data object with multiple encoded UTF8 strings, end-to-end. I can convert the first string to a String object by passing the Data object to a String constructor. But then I want to know how many bytes were consumed so that I can convert the next chunk of UTF bytes.

For example:

let str =  String(decoding:rawdata!, as: UTF8.self)

I want to know how many bytes in rawdata were consumed.

Solution 1: Convert the resulting string back into a UTF8View and count the bytes.

eg:

str.utf8.count

Solution 2: Insert a 1 byte String with a NUL character between the UTF8 encoded strings and detect it after converting the strings back to String objects. I assume this works in swift Strings, and my use case allows that my encoded strings will never have that value.

I would prefer a way to convert the initial part of a UTF8 array to a String object and also get back the number of bytes that was consumed by the conversion.

Is there a way to do this, or better work around?

Update: #1 is the solution I am leaning towards. I don't know if the String implementation recalculates the UTF8 encoding in order to count the bytes. It seems more efficient to remember the count when doing the conversion of UTF8 to String.

CodePudding user response：

Swift strings can contain NUL bytes (e.g. "Hello\u{0000}world!" is a valid String), so assuming your strings are terminated with NUL bytes, neither of your approaches would be sufficient.

Instead, you'll likely want to go with the approach that @Larme posted as a comment: split the data up first, and create strings from those slices.

If your separator is indeed a NUL byte, this can be as simple as

import Foundation

func decode(_ data: Data, separator: UInt8) -> [String] {
    data.split(separator: separator).map { String(decoding: $0, as: UTF8.self) }
}

let data = Data("Hello, world!\u{00}Following string.\u{00}And another one!".utf8)

print(decode(data, separator: 0x00))
// => ["Hello, world!", "Following string.", "And another one!"]

The split(separator:) method here is Sequence.split(separator:maxSplits:omittingEmptySubsequences:), which takes a separator of a single Sequence.Element — in this case, a single UInt8. Because omittingEmptySubsequences defaults to true, this will work even if your separator is N NUL bytes in a row (since you'll get N - 1 empty splits, all of which will be thrown away).

If your separator is more complicated, there isn't quite a similar convenience method, but you can still use Data methods to find instances of your separator sequence and split around them yourself:

import Foundation

func decode(_ data: Data, separator: String) -> [String] {
    // `firstRange(of:)` below takes a type conforming to `DataProtocol`.
    // `String.UTF8View` doesn't conform, but `Array` does. This copy should
    // be cheap if the separator is small.
    let separatorBytes = Array(separator.utf8)
    var strings = [String]()
    
    // Slicing the data will give cheap no-copy views into it.
    // This first slice is the full data blob.
    var slice = data[...]

    // As long as there's an instance of `separator` in the  data...
    while let separatorRange = slice.firstRange(of: separatorBytes) {
        // ... pull out all of the bytes before it into a String...
        strings.append(String(decoding: slice[..<separatorRange.lowerBound], as: UTF8.self))

        // ... and skip past the separator to keep looking for more.
        slice = slice[separatorRange.upperBound...]
    }
    
    // If there are no separators, in the string, or the last string is not
    // terminated with a separator itself, pull out the remaining contents.
    if !slice.isEmpty {
        strings.append(String(decoding: slice, as: UTF8.self))
    }
    
    return strings
}

let separator = "\u{00}\u{20}\u{00}"
let data = Data("Hello, world!\(separator)Following string.\(separator)And another one!".utf8)
print(decode(data, separator: separator))
// => ["Hello, world!", "Following string.", "And another one!"]

CodePudding user response：

Try this solution

str.utf8.count