I have a Data object with encoded UTF8 strings and other types of serialized values. The first version of this question assumed that UTF8 had a built-in string termination, but it doesn't.
A chunk of UTF8 characters will have the same issues that a chunk of ascii bytes has. The length of the string must be handled by either explicitly storing the length someplace, or by using a terminator (like NUL/0).
If you use a terminator, then you have to restrict the string contents so they do not contain the terminator value. This will then make your code not suitable for encoding all legal Swift Strings, but that may be okay depending on the application.
Here is the code I ended up with:
let buffer: Data = ...
let pos: Int = ...
let separator = UInt8(0x00)
let s = buffer[pos...].split(
separator: separator,
maxSplits: 1,
omittingEmptySubsequences: false).map {
data in
String(decoding: data, as: UTF8.self)
}[0]
pos = s.utf8.count 1
Note: If the buffer data has a single zero at the current location, it's important to use the omittingEmptySubsequences option so that an empty string is returned.
Note: Be careful of using Data.suffix. It always creates a new Data object starting relative to the original backing store. For example:
let data: Data = ...
let d1 = data.suffix(from: 10)
let d2 = d1.suffix(from: 10)
// d1 and d2 will have the same data.
That is why I chose to use the approach of keeping an integer location variable.
CodePudding user response:
Swift strings can contain NUL
bytes (e.g. "Hello\u{0000}world!"
is a valid String
), so assuming your strings are terminated with NUL
bytes, neither of your approaches would be sufficient.
Instead, you'll likely want to go with the approach that @Larme posted as a comment: split the data up first, and create strings from those slices.
If your separator is indeed a NUL
byte, this can be as simple as
import Foundation
func decode(_ data: Data, separator: UInt8) -> [String] {
data.split(separator: separator).map { String(decoding: $0, as: UTF8.self) }
}
let data = Data("Hello, world!\u{00}Following string.\u{00}And another one!".utf8)
print(decode(data, separator: 0x00))
// => ["Hello, world!", "Following string.", "And another one!"]
The split(separator:)
method here is Sequence.split(separator:maxSplits:omittingEmptySubsequences:)
, which takes a separator of a single Sequence.Element
— in this case, a single UInt8
. omittingEmptySubsequences
defaults to true
, so
- If empty strings are valid inputs and you need to process them, make sure to pass in
false
. Otherwise, - If your separator is
N
NUL
bytes in a row, this method will still work for you: you'll getN - 1
empty splits, all of which will be thrown away
Alternatively, if you don't want to eagerly split up the entire buffer up-front (e.g. you may be looking for a sentinel value which indicates to stop processing), you can split the buffer piecemeal by looping over the buffer and grabbing prefixes terminated by the separator using Data.prefix(while:)
:
import Foundation
func process(_ data: Data, separator: UInt8, using action: (String) -> Bool) {
var slice = data[...]
while !slice.isEmpty {
let substring = String(decoding: slice.prefix(while: { $0 != separator }), as: UTF8.self)
if !action(substring) {
break
}
slice = slice.dropFirst(substring.utf8.count 1)
}
}
let data = Data("Hello, world!\u{00}Following string.\u{00}And another one!".utf8)
process(data, separator: 0x00) { string in
print(string)
return true // continue
}
If your separator is more complicated (e.g., multiple different characters long), you can still use Data
methods to find instances of your separator sequence and split around them yourself:
import Foundation
func decode(_ data: Data, separator: String) -> [String] {
// `firstRange(of:)` below takes a type conforming to `DataProtocol`.
// `String.UTF8View` doesn't conform, but `Array` does. This copy should
// be cheap if the separator is small.
let separatorBytes = Array(separator.utf8)
var strings = [String]()
// Slicing the data will give cheap no-copy views into it.
// This first slice is the full data blob.
var slice = data[...]
// As long as there's an instance of `separator` in the data...
while let separatorRange = slice.firstRange(of: separatorBytes) {
// ... pull out all of the bytes before it into a String...
strings.append(String(decoding: slice[..<separatorRange.lowerBound], as: UTF8.self))
// ... and skip past the separator to keep looking for more.
slice = slice[separatorRange.upperBound...]
}
// If there are no separators, in the string, or the last string is not
// terminated with a separator itself, pull out the remaining contents.
if !slice.isEmpty {
strings.append(String(decoding: slice, as: UTF8.self))
}
return strings
}
let separator = "\u{00}\u{20}\u{00}"
let data = Data("Hello, world!\(separator)Following string.\(separator)And another one!".utf8)
print(decode(data, separator: separator))
// => ["Hello, world!", "Following string.", "And another one!"]
CodePudding user response:
Here is the code I ended up with:
let buffer: Data = ...
let pos: Int = ...
let separator = UInt8(0x00)
let s = buffer[pos...].split(
separator: separator,
maxSplits: 1,
omittingEmptySubsequences: false).map {
data in
String(decoding: data, as: UTF8.self)
}[0]
pos = s.utf8.count 1
Note: If the buffer data has a single zero at the current location, it's important to use the omittingEmptySubsequences option so that an empty string is returned.
Note: Be careful of using Data.suffix. It always creates a new Data object starting relative to the original backing store. For example:
let data: Data = ...
let d1 = data.suffix(from: 10)
let d2 = d1.suffix(from: 10)
// d1 and d2 will have the same data.
That is why I chose to use the approach of keeping an integer location variable.
CodePudding user response:
Try this solution
str.utf8.count