In Swift, how can I find the byte length of a UTF8 string within a larger buffer-CodePudding

I have a Data object with encoded UTF8 strings and other types of serialized values. The first version of this question assumed that UTF8 had a built-in string termination, but it doesn't.

A chunk of UTF8 characters will have the same issues that a chunk of ascii bytes has. The length of the string must be handled by either explicitly storing the length someplace, or by using a terminator (like NUL/0).

If you use a terminator, then you have to restrict the string contents so they do not contain the terminator value. This will then make your code not suitable for encoding all legal Swift Strings, but that may be okay depending on the application.

Here is the code I ended up with:

let buffer: Data = ...
let pos: Int = ...
let separator = UInt8(0x00)
let s = buffer[pos...].split(
    separator: separator,
    maxSplits: 1,
    omittingEmptySubsequences: false).map { 
        data in
        String(decoding: data, as: UTF8.self)
    }[0]
pos  = s.utf8.count   1

Note: If the buffer data has a single zero at the current location, it's important to use the omittingEmptySubsequences option so that an empty string is returned.

Note: Be careful of using Data.suffix. It always creates a new Data object starting relative to the original backing store. For example:

let data: Data = ...
let d1 = data.suffix(from: 10)
let d2 = d1.suffix(from: 10)
// d1 and d2 will have the same data.

That is why I chose to use the approach of keeping an integer location variable.

CodePudding user response：

Swift strings can contain NUL bytes (e.g. "Hello\u{0000}world!" is a valid String), so assuming your strings are terminated with NUL bytes, neither of your approaches would be sufficient.

Instead, you'll likely want to go with the approach that @Larme posted as a comment: split the data up first, and create strings from those slices.

If your separator is indeed a NUL byte, this can be as simple as

import Foundation

func decode(_ data: Data, separator: UInt8) -> [String] {
    data.split(separator: separator).map { String(decoding: $0, as: UTF8.self) }
}

let data = Data("Hello, world!\u{00}Following string.\u{00}And another one!".utf8)

print(decode(data, separator: 0x00))
// => ["Hello, world!", "Following string.", "And another one!"]

The split(separator:) method here is Sequence.split(separator:maxSplits:omittingEmptySubsequences:), which takes a separator of a single Sequence.Element — in this case, a single UInt8. omittingEmptySubsequences defaults to true, so

If empty strings are valid inputs and you need to process them, make sure to pass in false. Otherwise,
If your separator is N NUL bytes in a row, this method will still work for you: you'll get N - 1 empty splits, all of which will be thrown away

Alternatively, if you don't want to eagerly split up the entire buffer up-front (e.g. you may be looking for a sentinel value which indicates to stop processing), you can split the buffer piecemeal by looping over the buffer and grabbing prefixes terminated by the separator using Data.prefix(while:):

import Foundation

func process(_ data: Data, separator: UInt8, using action: (String) -> Bool) {
    var slice = data[...]
    while !slice.isEmpty {
        let substring = String(decoding: slice.prefix(while: { $0 != separator }), as: UTF8.self)
        if !action(substring) {
            break
        }
        
        slice = slice.dropFirst(substring.utf8.count   1)
    }
}

let data = Data("Hello, world!\u{00}Following string.\u{00}And another one!".utf8)
process(data, separator: 0x00) { string in
    print(string)
    return true // continue
}

If your separator is more complicated (e.g., multiple different characters long), you can still use Data methods to find instances of your separator sequence and split around them yourself:

import Foundation

func decode(_ data: Data, separator: String) -> [String] {
    // `firstRange(of:)` below takes a type conforming to `DataProtocol`.
    // `String.UTF8View` doesn't conform, but `Array` does. This copy should
    // be cheap if the separator is small.
    let separatorBytes = Array(separator.utf8)
    var strings = [String]()
    
    // Slicing the data will give cheap no-copy views into it.
    // This first slice is the full data blob.
    var slice = data[...]

    // As long as there's an instance of `separator` in the  data...
    while let separatorRange = slice.firstRange(of: separatorBytes) {
        // ... pull out all of the bytes before it into a String...
        strings.append(String(decoding: slice[..<separatorRange.lowerBound], as: UTF8.self))

        // ... and skip past the separator to keep looking for more.
        slice = slice[separatorRange.upperBound...]
    }
    
    // If there are no separators, in the string, or the last string is not
    // terminated with a separator itself, pull out the remaining contents.
    if !slice.isEmpty {
        strings.append(String(decoding: slice, as: UTF8.self))
    }
    
    return strings
}

let separator = "\u{00}\u{20}\u{00}"
let data = Data("Hello, world!\(separator)Following string.\(separator)And another one!".utf8)
print(decode(data, separator: separator))
// => ["Hello, world!", "Following string.", "And another one!"]

CodePudding user response：

Here is the code I ended up with:

let buffer: Data = ...
let pos: Int = ...
let separator = UInt8(0x00)
let s = buffer[pos...].split(
    separator: separator,
    maxSplits: 1,
    omittingEmptySubsequences: false).map { 
        data in
        String(decoding: data, as: UTF8.self)
    }[0]
pos  = s.utf8.count   1

Note: If the buffer data has a single zero at the current location, it's important to use the omittingEmptySubsequences option so that an empty string is returned.

Note: Be careful of using Data.suffix. It always creates a new Data object starting relative to the original backing store. For example:

let data: Data = ...
let d1 = data.suffix(from: 10)
let d2 = d1.suffix(from: 10)
// d1 and d2 will have the same data.

That is why I chose to use the approach of keeping an integer location variable.

CodePudding user response：

Try this solution

str.utf8.count